aws-amplify / amplify-swift

A declarative library for application development using cloud services.
Apache License 2.0
456 stars 198 forks source link

Datastore offline synch behaves differently between simulator and device #1324

Closed testflyjets closed 3 years ago

testflyjets commented 3 years ago

Describe the bug

We were using AutoMerge as our conflict resolution strategy. When the app was offline, if multiple different changes were made to a model and saved to the datastore, it appeared that only the first change was actually synched back to Dynamo once the connection was reestablished. The behavior seemed to indicated that the first change was accepted by the resolver, which caused the version number to increment and the current data (what was in Dynamo + the one new change) to be synched back to the device, overwriting all the other changes that were made offline.

We switched to using a simple custom Lambda for conflict resolution, one that essentially just RESOLVES all changes:

exports.handler = async (event, context, callback) => {
  console.log("AppSync conflict event received {}", JSON.stringify(event, 3));

  if (
    event.resolver.field.startsWith("add") ||
    event.resolver.field.startsWith("create") ||
    event.resolver.field.startsWith("update")
  ) {
    console.log("add/update conflict, accepting changes");
    return {
      action: "RESOLVE",
      item: event.newItem,
    };
  } else if (event.resolver.field.startsWith("delete")) {
    console.log("delete conflict, accepting delete");
    return {
      action: "REMOVE",
    };
  } else {
    console.log("Unexpected conflict {}", event.resolver.field);
    throw new Error("Unknown Resolver");
  }
};

With this strategy in place, running the app in the iOS simulator it works exactly as we'd hoped -- all of the changes that were made while offline all get synched to Dynamo.

However, running the exact same code on an actual device, we are seeing the original behavior, in which only the first change that is made while offline is synched to Dynamo when connectivity is restored.

We also noted what seemed to be odd behavior depending on how "offline" was established on the device. Using Airplane Mode instead of turning Cellular Data off directly seemed to cause the app to show somewhat random synching behavior while just turning off Wifi and Cell Data directly resulted in repeatable behavior.

Steps To Reproduce

Steps to reproduce the behavior:
1. Run the app in the iOS simulator with a simple custom Lambda for conflict resolution.
2. Turn off all network connections on the computer.
3. Make multiple changes to one model record in the app that are saved to the local datastore.
4. Restore the computer's network connection.
5. See that all changes to the model are properly synched to Dynamo.
6. Run the same app on an iOS device. 
7. Turn off WiFi and cellular data on the device and turn off all network connections on the computer.
8. Make multiple changes to one model record in the app.
9. Restore the device's and computer's network connectivity.
10. Observe that only the first change that was saved is synched to Dynamo.

Expected behavior

We expect the application to behave the same in both the iOS simulator and the device.

Amplify Framework Version

1.9.1

Amplify Categories

DataStore

Dependency manager

Swift PM

Swift version

5.3.2

CLI version

4.47.1

Xcode version

12.4 (12D4e)

Relevant log output

N/A

Is this a regression? (i.e. was this working before a version upgrade)

No.

Device

iPhone 12 simulator and iPhone 12 device

iOS Version

14.6

Specific to simulators

No response

Additional context

No response

palpatim commented 3 years ago

Since the Lambda resolvers are invoked by AppSync, I'm not sure what is implicating the Amplify iOS library code here. Are you able to turn on CloudWatch logging for your AppSync API to confirm that your custom resolver is in fact being invoked?

testflyjets commented 3 years ago

Whether the custom resolver runs or not doesn't seem to be the issue here. The issue is that the behavior of AppSync in terms of how changes made offline get synched to the cloud is different when we run the app in the iOS simulator versus when we run it on the device.

What method do you suggest to simulate loss of network connectivity in the simulator that would be similar to turning off WiFi and cellular data on a device?

royjit commented 3 years ago

I was not able to replicate the issue in a sample app. Here are the steps followed:

  1. Created a sample app with Todo model
  2. Added custom lambda resolvers as mentioned in this issue
  3. Ran the app in device and saved a todo model named "Todo 3"
  4. Turned off the wifi and data in the device
  5. Updated the description of the todo multiple times
  6. Turned on the wifi and data
  7. Checked the dynamodb to find the last updated value is present in the dynamo db

Logic used to update multiple times

queryAndUpdate()

func queryAndUpdate(_ num: Int = 0) {

        Amplify.DataStore.query(Todo.self, where: Todo.keys.name == "Todo 3") { result in
            switch result {
            case .success(let todos):
                print("Todo fetched successfully! \(todos)")
                var todo = todos[0]
                todo.description = "Description \(num)"
                update(todo: todo)

            case .failure(let error):
                print("Error fetching todo \(error)")
            }
        }
    }

    func update(todo: Todo) {

        Amplify.DataStore.save(todo) { result in
            switch result {
            case .success:
                print("Todo saved successfully!")
                self.num = self.num + 1
                if (self.num < 30) {
                    self.queryAndUpdate(self.num)
                }
            case .failure(let error):
                print("Error saving todo \(error)")
            }
        }
    }
royjit commented 3 years ago

Few questions to help us debug further:

  1. Are you seeing any error in the Xcode console while saving or syncing?
  2. Any runtime error in lambda resolvers that prevents from saving all the updates?
  3. Can you see if the issue still exists if you use the latest Amplify library 1.12.0. I see few datastore fixes after 1.9.1 which might be related to the issue.
testflyjets commented 3 years ago

Hi, @royjit, thanks for doing these tests. If you are able to share this ToDo app with me I will run it locally and see if I can replicate your results.

As for seeing errors in Xcode, I don't see any exceptions per se. However, I do see some odd behavioral differences between running the app on the device and in the iOS simulator. I added logging of many of the DataStore events to try to understand precisely what is happening in the app around the synchronization of changes to AWS.

For example, in the simulator, when I turn WiFi off on my laptop, the DataStore networkStatus event indicates the loss of connectivity. However, when the network connection is restored the networkStatus event indicating restoration of connectivity is not fired. On the device the networkStatus event is generated and logged immediately on loss and regain of network connectivity. My concern is that Amplify behaves differently if it thinks the device is connected to the network, and the "correct" behavior we're seeing in the simulator is actually a side-effect from a bug!

Also, when connectivity is restored in the simulator, the outboxMutationProcessed event fires for the first update, but the outboxStatus event shows the outbox is NOT empty, presumably because the subsequent changes need to be processed. I then see an additional outboxMutationEnqueued event fire that contains the additional changes that I made.

On the device, after the outboxMutationProcessed event the outboxStatus shows the DataStore outbox is empty. Therefore, the additional outboxMutationEnqueued event never fires, and if the additional changes are never queued for synch back to Dynamo, the data doesn't get updated. Again, this is my interpretation of what I think I'm seeing.

The Lambda resolver only executes when the second set of changes are sent to AWS via the iOS simulator, and there are no errors present. Since the app running on the device doesn't trigger a "second" update there are no conflicts to resolve (my assumption here).

I did update to amplify-ios 1.12.0 to see if this problem went away. Sadly, it did not. I saw the same datastore changes and was hopeful but alas, no such luck.

Please let me know if there's anything else I can do to help you move forward.

palpatim commented 3 years ago

I've been able to reproduce this locally and I think I have an idea what's going on. Will update as soon as I have more information.

testflyjets commented 3 years ago

Awesome! Great news, and we look forward to your next update. I'm in the process of detailing the sequence of events and log messages I see in the two cases (iOS sim vs. device). I was going to attach that to the open AWS Support case, and am happy to provide it here as well. Let me know if that would be helpful.

palpatim commented 3 years ago

I'll see it either way, so whichever is most convenient for you. :)

palpatim commented 3 years ago

I'm still investigating this, but here are my findings so far:

Discussion

The Sync Engine has a cleanup routine it runs during initial startup that sets locally persisted MutationEvents to have an inProcess status of false. This is to recover from cases where the app was terminated while the mutation queue was processing an event. Upon restart, the OutgoingMutationQueue picks up the first not-in-process item in the queue and sends it to the backend.

This is all working as expected and designed.

However, we use the same cleanup routine in no-network situations, and there is a problem with some of our assumptions about the system state at that point. The behavior I'm seeing is:

What's not yet clear is the fix. It could be that we shouldn't clean up in-process events for no-network situations; or that we should change our disposition logic to simply enqueue all mutation events regardless of whether there is any previous event; or some other fix. We are discussing the right way forward and will update when we have more info.

testflyjets commented 3 years ago

Impressive -- thanks for the deep dive, and for the test you wrote for this. Your detailed observations track precisely with the behavior we have observed. We are standing by for a fix or a workaround.

palpatim commented 3 years ago

We believe the right thing to do here is to cancel the in-process, in-memory operation when the subscription receives a network error. We're working on implementing that in a safe way. Updates as we get them.

testflyjets commented 3 years ago

Awesome, thanks!

palpatim commented 3 years ago

We attempted to remedy this with a targeted fix, but it doesn't cover edge cases the way we want. We're now trying to introduce a new "pausing" state that is distinct from the previous one, that safely removes in-process sync operations from the OutgoingMutationQueue. That's a larger change, and we're still in process on it.

palpatim commented 3 years ago

We have something working locally, which you can view on our working branch if you'd like to preview it. Unit tests pass, and we're running through integration tests now to ensure everything works as expected.

testflyjets commented 3 years ago

@palpatim I am trying to use the Swift Package Manager to pull in your hotfix branch. It seems to pull it in correctly when I point it at your branch but I'm seeing an error when I build the code with that version of Amplify. It's complaining about something that appears related logging in the Cognito Auth plugin. Is this perhaps some leftover debug logging?

Screen Shot 2021-07-29 at 9 05 53 AM

Also, the package manager doesn't give me the option to select the plugins per the Amplify documentation here. I'm relatively new to using SPM for managing dependencies -- is there something else I need to do here to try this fix?

testflyjets commented 3 years ago

I commented out the two lines that Xcode was complaining about and the code runs now, FYI.

palpatim commented 3 years ago

Thanks--the latest push to that branch fixes that problem (I made the change in a different target than was active, so didn't see the compile error).

testflyjets commented 3 years ago

No worries. Good news! The fix definitely works on the device based on our testing just now.

testflyjets commented 3 years ago

Quick question: should Amplify Storage work the same way, in that a file upload requested while offline will resume once connectivity is reestablished?

We are calling Storage like this:

let storageOperation = Amplify.Storage.uploadData(key: key, data: imageData, options: options)

We aren't seeing the upload retried once the app is reconnected currently, but it's not clear from the documentation if it should work that way. We are hoping it does, and I figure you'd know.

palpatim commented 3 years ago

I think that Storage is supposed to allow you to queue an upload while offline, and then fulfill it when network resumes, but I'm not 100% sure on that. @brennanMKE, can you clarify expected behavior for Amplify.Storage?

palpatim commented 3 years ago

Regarding the original DataStore issue: the team is reviewing the PR. Assuming all looks good, the fix for this will go out in a patch release.

brennanMKE commented 3 years ago

@testflyjets

You can skip to the recommendation in the last section if you like. This first section offers some context for the AWS SDK and Amplify SDK.

Context

In the underlying implementation you can see it allows for passing in values for allowsCellularAccess and timeoutIntervalForResource. This configuration is used for all network activity run through Transfer Utility. You can see the default for allowsCellularAccess is set in AWSNetworking to allow by default. Without it all of these network tasks would not be allowed unless WiFi is connected. It could be set in amplifyconfiguration.json or awsconfiguration.json.

From the Amplify Storage side you can see the upload options do not include these properties so it is not possible to specify unique behavior per request.

There is a property called waitsForConnectivity which will either fail immediately if there is no access or queue the upload and let it start when a viable network path is available. All requests run by Transfer Utility are created with a background session and so the last line is relevant here.

This property is ignored by background sessions, which always wait for connectivity.

Recommendation

You can take control of this behavior by directly starting or canceling network operations based on connectivity status.

If your Deployment Target is iOS 12.0 or later you could use Apple's Network framework to monitor the network path and start or cancel operations in response to changes. If you need to support earlier versions you could use a package that I prepared called Connectivity which falls back to using the Reachability library. It does not work as well as Apple's modern framework designed for mobile scenarios which were not common when Reachability was first created, but your app metrics likely show a small percentage of your users are using a version of iOS before 12. It will be sufficient for those users while giving the majority of your users the optimal behavior.

That package includes a demo app that you can use to observe changes in your device's network path. You can put the app on an iPhone, disable the cellular radio and walk out of range of your WiFi network to see it change, or just turn on Airplane Mode. One feature you'll likely want to use to benefit your users is isExpensive. It is designed to be intelligent about the current network path. If you are on regular WiFi the value will be false. If you are a cellular connection, which likely involves data rates, it will be true. And if you happen to be using a WiFi hotspot which connects via cellular the value will also be true. This is one feature of the Network framework which is missing from Reachability. This package sets this property based on the network status alone with the Reachability implementation. You can avoid large uploads when your users are not on WiFi and save them both battery life and potentially higher data rates.

When the network path changes from .unsatisfied to .satisfied your network operations can run while if the opposite happens your can choose to cancel those network activities and start them again when a viable path is available. In my experience these interruptions are often momentary, like going through a tunnel or riding in an elevator. And when I get off the elevator it also means I will switch from cellular to WiFi or vice versa. You can make your code behavior exactly how you choose with this framework.

testflyjets commented 3 years ago

@brennanMKE Thanks for the context, it's helpful. So as not to pollute the original DataStore issue with this Storage issue, I'll open a new issue here so we can continue this conversation separately. We are not seeing Storage retry a failed upload when connectivity is restored, and if I'm understanding what you wrote above, it should.

palpatim commented 3 years ago

We've merged the fix for this issue into main, but are still working through some sporadic test failures and fixing a data race that we found as well. If you'd like to test the main branch, it should have the outgoing mutation queue fix applied, but not the data race fix.

palpatim commented 3 years ago

We fixed the issues we found and things seem solid now. We'll get this released to prod as soon as possible.

palpatim commented 3 years ago

We released this to prod in 1.13.1. I'm optimistically marking this as "closing soon", but please let us know if you have any other issues!

testflyjets commented 3 years ago

Great, thanks, we'll update to that production release. We very much appreciate all the work you and your team did on this issue.

On Fri, Aug 6, 2021 at 8:11 AM Tim Schmelter @.***> wrote:

We released this to prod in 1.13.1 https://github.com/aws-amplify/amplify-ios/releases/tag/v1.13.1. Please let us know if you have any other issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws-amplify/amplify-ios/issues/1324#issuecomment-894327728, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABHGKHJMSOMIR6EI2SHMITT3P3SBANCNFSM5AP5ZA3A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

diegocstn commented 3 years ago

@testflyjets I'll optimistically close this, but please feel free to reopen if v1.13.1 didn't fully solve your issue