Cloudkibo / CloudKibo

CloudKibo
0 stars 0 forks source link

Error Recovery #483

Closed sojharo closed 7 years ago

sojharo commented 7 years ago

ASAK Sojharo & Sumaira

We recently had two issue with iOS that are related to error recovery.

1) The device tried to do registration with the hub and for some reason the registration failed may be due to network we really do not know. But we did not checked for failure or success. Later notification were not getting delivered.

2) A notification was sent to iOS and it made the API request to get the data. From the log it says server sent the data. Some how client did not get the data. The current clinet login assumes data would be returned successfully. We do not have any error recovery logic.

The above two issues will happen more when we have more device using the system.

Sojharo can you take the lead and update the design document so we have one method of recovery for both devices. Sumaira is already working on a solution. Please work with her and let's come up with a one solution.

Please move this email to GitHub

Allah Hafiz

Jawaid

sumairasaeed commented 7 years ago

Walaikumassalaam. Acknowledged.

On Wed, Dec 21, 2016 at 8:34 PM, Sojharo notifications@github.com wrote:

Assigned #483 https://github.com/Cloudkibo/CloudKibo/issues/483 to @sumairasaeed https://github.com/sumairasaeed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Cloudkibo/CloudKibo/issues/483#event-901754143, or mute the thread https://github.com/notifications/unsubscribe-auth/AKbhp7VxqQULM4LkxjWTI9KXlyp4Auqmks5rKUcDgaJpZM4LTEch .

jekram commented 7 years ago

@sojharo Let's get the design done first

sojharo commented 7 years ago

Yes sir. I have taken this task for today

sojharo commented 7 years ago

1) I carefully studied the implementation code for push notification given to us by Azure. This logic was given to us by Azure for both iOS and Android application to integrate push notifications with our application. From inspecting the code in detail, I found that it registers with notification hub each time application is started. It has service which runs in the background to register with notification hub and this service keeps running all the time. Also, I found there was a logic to refresh the registration token if token is expired or is compromised. As the background service starts each time the application starts and keeps running, it automatically cover the failure cases. In Android cases, we inform on the UI if registration of token is failed. This happens when there is no Internet or we are behind IBA proxy. Nevertheless, the logic tries again when we start application next time. As far as I have studied the code, I think the implementation logic given to us by Azure is fail-proof and automatically registers and user has nothing to do. I think on iOS sumaira needs to look into the implementation code given by Azure and then check it with iOS current implementation to see if something has been changed in this part. Here is my registration code given by Azure which I used as it is. We can see it handles all use cases:

    protected void onHandleIntent(Intent intent) {
        SharedPreferences sharedPreferences = PreferenceManager.getDefaultSharedPreferences(this);
        String resultString = null;
        String regID = null;
        String storedToken = null;

        try {
            DatabaseHandler db = new DatabaseHandler(getApplicationContext());
            user = db.getUserDetails();

            String FCM_token = FirebaseInstanceId.getInstance().getToken();
            Log.d(TAG, "FCM Registration Token: " + FCM_token);

            // Storing the registration id that indicates whether the generated token has been
            // sent to your server. If it is not stored, send the token to your server,
            // otherwise your server should have already received the token.
            if (((regID=sharedPreferences.getString("registrationID", null)) == null)){
                NotificationHub hub = new NotificationHub(NotificationSettings.HubName,
                        NotificationSettings.HubListenConnectionString, this);
                Log.d(TAG, "Attempting a new registration with NH using FCM token : " + FCM_token);
                //regID = hub.register(FCM_token).getRegistrationId();

                String phone = user.get("phone").substring(1);

                regID = hub.register(FCM_token, user.get("phone").substring(1)).getRegistrationId();

                // If you want to use tags...
                // Refer to : https://azure.microsoft.com/en-us/documentation/articles/notification-hubs-routing-tag-expressions/
                // regID = hub.register(token, "tag1", "tag2").getRegistrationId();

                resultString = "New NH Registration Successfully - RegId : " + regID;
                Log.d(TAG, resultString);
                sharedPreferences.edit().putString("registrationID", regID ).apply();
                sharedPreferences.edit().putString("FCMtoken", FCM_token ).apply();
            }
            // Check if the token may have been compromised and needs refreshing.
            else if ((storedToken=sharedPreferences.getString("FCMtoken", "")) != FCM_token) {

                NotificationHub hub = new NotificationHub(NotificationSettings.HubName,
                        NotificationSettings.HubListenConnectionString, this);
                Log.d(TAG, "NH Registration refreshing with token : " + FCM_token);
                //regID = hub.register(FCM_token).getRegistrationId();

                String phone = user.get("phone").substring(1);

                regID = hub.register(FCM_token, user.get("phone").substring(1)).getRegistrationId();

                // If you want to use tags...
                // Refer to : https://azure.microsoft.com/en-us/documentation/articles/notification-hubs-routing-tag-expressions/
                // regID = hub.register(token, "tag1,tag2").getRegistrationId();

                resultString = "New NH Registration Successfully - RegId : " + regID;
                Log.d(TAG, resultString);

                sharedPreferences.edit().putString("registrationID", regID ).apply();
                sharedPreferences.edit().putString("FCMtoken", FCM_token ).apply();
            }
            else {
                resultString = "Previously Registered Successfully - RegId : " + regID;
            }
        } catch (Exception e) {
            Log.e(TAG, resultString="Failed to complete token refresh", e);
            // If an exception happens while fetching the new token or updating our registration data
            // on a third-party server, this ensures that we'll attempt the update at a later time.
        }

        // Notify UI that registration has completed.
        if (MainActivity.isVisible) {
            //MainActivity.mainActivity.ToastNotify2(resultString);
            Utility.sendLogToServer(resultString+ " FOR NUMBER: "+ user.get("phone").substring(1));
        }
    }

2) I have revisited the logic for this. We have two cases here. Case 1 If this is partial chat sync (correction: one to one chat), then our logic is clear and server doesn't assume any message as delivered until client gives the confirmation for that message by calling updateStatus API. So, if the client didn't get message from server, then client should not call the updateStatus API and try again to sync the messages from server. One problem might be that client didn't check if the message is properly delivered or not and called the updateStatus API to let server know that this message is delivered to client. Case 2 If this is during one to one chat (correction: partial sync chat), and notification was received by client for chat message and client tried to fetch the chat message, at that time server marks the chat as delivered. However, it is the responsibility of client to check if server has given the chat message in response or not or due to network failure the message was not received by client which was sent by server. The client has the chat message id from the notification and client can use this id to call the server again to fetch the same message. Possible error can happen when server sent the required message to client but client didn't receive it due to sudden network problem. The easiest solution is, as client already has the chat id, to try again to fetch the chat message with the chat id. If multiple tries are failed then client should store the chat id in sqlite table to try again later.

@sumairasaeed kindly review the analysis on both problems and their solutions. Once agreed, I would move them to design document. For problem no 2, I think your issue is case 1 and i have provided the solution for that. For problem no 1, I suggest revisit the notification implementation code given by Azure as it contains all the recovery logics even for when token is compromised.

jekram commented 7 years ago

@sojharo Thanks for the detailed response.

sumairasaeed commented 7 years ago

Thanks @sojharo . I will also look at Azure documentation for iOS to re-check if I have missed any step.

Our iOS problem case is "Case 2". I had one point to discuss.

@sojharo You commented this on Case2: " client tried to fetch the chat message, at that time server marks the chat as delivered. However, it is the responsibility of client to check if server has given the chat message in response or not "

This design logic implemented on server, wrongly sends the status update of "delivered" to sender. Because there are chances that the fetch request fails and receiver do not get chat message. Sender will wrongly assume that message was "Delivered" as server sends "delivered" notification to sender even if chat fetch request failed at receiver's end. Please correct me if I misunderstood anything. @sojharo @jekram

Due to this design, the issue # 437 happened on iOS. Now it is very clear. Server wrongly changed status of message as "delivered" when it was actually not delivered as fetch API request had failed. As the status got updated to "delivered" so reciever didnot get that missing message though Partial Sync when internet was back.

It will be better that message status should be changed to "delivered" by receiver and not by server. When client successfully fetches and receives the message then it should call updateStatusAPI to send status update as "delivered". This is exactly how we are doing in Case 1. @jekram @sojharo kindly suggest if you agree?

On Fri, Dec 23, 2016 at 2:51 AM, Cloudkibo notifications@github.com wrote:

@sojharo https://github.com/sojharo Thanks for the detailed response.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Cloudkibo/CloudKibo/issues/483#issuecomment-268898513, or mute the thread https://github.com/notifications/unsubscribe-auth/AKbhp6fTFSxvEAoFMiBI9QHyLzGUk1lwks5rKvDbgaJpZM4LTEch .

sojharo commented 7 years ago

During design for group chat, we had discussed this that once the client fetches the data we would change the status to delivered. It would reduce the http requests to server. Instead of two http requests, one for fetching chat and one for confirming server that I have fetched the chat, for sync we have this one http request.

From the comment of Sir, "A notification was sent to iOS...." it means this is case 1, where it was a single chat message sent by someone. In case 1, as I discussed, server doesn't change the status to delivered automatically so if this is the error then it is on iOS side.

As you said the error is case 2, then this is sync logic. It is necessary to change the status of messages to "delivered" when you are doing sync because they are lots of messages. Now for example, there are 3 clients and each of them haven't synced 100 messages from server. Now the sync starts and according to solution you give for calling updateStatusAPI for each message received, 3 clients will call updateStatusAPI 100 times each. Collectively, it would become 300 confirmation requests to server at a single time. It is like a mini DoS attack on server.

This is the reason I have used updateStatusAPI for fetching single chat and not fetching chats using sync. Also, I had discussed once in office that we would be going to reduce requests to server because we don't want to put too much load on server.

I think from the comment of Sir, your problem is not with sync and it is with when there is a notification for chat. Also, in all places in group chat we are using the same logic which is in partial chat sync of one to one, I wonder why the problem didn't occur at that place. Nevertheless, I am going to make a small change in partial chat sync logic. Client will now send the last message date to server and server would give the chat messages after this date. So if server marked the chat as delivered and client had not received it then client can try again.

I just checked https://github.com/Cloudkibo/iOS/issues/437 It is for case 1 and not for case 2. Please see the description of error by sir. The message was sent from one phone and other phone got the chime but didn't show the message. This is case 1 and case 2 is for sync.

sojharo commented 7 years ago

Correction: the case 1 is for fetching a single chat message when notification was received and the case 2 is for doing partial chat sync.

In my first comment I confused case 1 with case 2.

sojharo commented 7 years ago

Server is marking the chat as resolved. Due to which client on iOS is having problems. We mark the chat as resolved to reduce calls to server. I am thinking more on solution. Till then iOS can put a check if message was not received by iOS when server sent it. Were you able to reproduce this error multiple times on iOS?

sojharo commented 7 years ago

Sumaira had made one change on server on this commit:

https://github.com/Cloudkibo/CloudKibo/commit/0333f790e281b14bdf9315d0c1751db6dc83c6b5

This was somehow overwritten by work which I had done. Sumaira said this might be the reason of the problem she is having. I have corrected it in my recent commit.

One other reason that sumaira discussed was that when client sent request to server to get chat message, client might have disconnected from Internet until server responded. We tried lot to reproduce this in office today but it didn't get reproduce. The gap between request and response from server was in microseconds. This might not be reason according to me.

However, I have thought of the solution for this. I also discussed with sumaira that we should add the retry in our network requests. However, instead of coding this, we should leverage the network library i.e. android network library ion has the option to retry. I can give the number of times which it should retry.

jekram commented 7 years ago

@sojharo How do should we avoid overwriting the work? If multiple people are updating the code, so should not use the branch and pull ?

jekram commented 7 years ago

We should leverage the library provided by iOS and Android instead of writing our own

sojharo commented 7 years ago

yes, we should use branch and pull. sumaira had directly written to master branch and not created her own branch and make pull request. I think she also has the same permissions on the cloudkibo repository which I have.

jekram commented 7 years ago

@sumairasaeed Please learn branch and pull of GitHub. When we have multiple people updating repo that is the only way to go.

sumairasaeed commented 7 years ago

yes. acknowledged

On Sat, Dec 24, 2016 at 7:32 AM, Cloudkibo notifications@github.com wrote:

@sumairasaeed https://github.com/sumairasaeed Please learn branch and pull of GitHub. When we have multiple people updating repo that is the only way to go.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Cloudkibo/CloudKibo/issues/483#issuecomment-269065004, or mute the thread https://github.com/notifications/unsubscribe-auth/AKbhp5voRcmGsBYpSAMtKQFSJj6Zzm4Pks5rLIRbgaJpZM4LTEch .

sojharo commented 7 years ago

I have completed the work on this and pushed the code to production. We can now close this. In partial chat sync, I have made it to mark chat as delivered so that we don't have to call separate API to update status. It would reduce network calls to server.

@sumairasaeed please update your part in iOS.

Note: I am not using date as parameter as it was giving duplicate messages and was creating problems.

jekram commented 7 years ago

@sojharo please update the design document to reflect these changes

sojharo commented 7 years ago

G I have updated the document in our REST API swagger docs.

https://github.com/Cloudkibo/KiboChatAPI/blob/master/dist/server/api/KiboChatApi.yaml

jekram commented 7 years ago

Thanks