firebase / firebase-ios-sdk

Firebase SDK for Apple App Development
https://firebase.google.com
Apache License 2.0
5.41k stars 1.43k forks source link

Firebase Storage: Support scheduling uploads while offline #6451

Open willbattel opened 3 years ago

willbattel commented 3 years ago

[REQUIRED] Step 1: Describe your environment

[REQUIRED] Step 2: Describe the problem

Full disclosure: I'm having a hard time reproducing this issue. Additionally, while I am fairly confident that the problem lies in a recent Firebase SDK update, I can't prove it. Therefore, I am raising this issue with the information I have in hopes of getting the opinion(s) of the Firebase iOS team.

Our app allows users to create posts with an image that we upload to Firebase Storage. Until recently we were using SDK Version 6.18.0 without issue. In July we updated our app to use SDK Version 6.28.0 and rolled it out to production after not finding any problems during testing. After doing this, we noticed that some images were unexpectedly not being uploaded. Of the posts being created, approximately 1 in 10 are missing their image.- which will become a big problem if we can't quickly resolve it. After some review, we determined that there were no code changes on our end that should have had any impact.

The only changes to our app at this time were dependency updates- where Firebase was one of three dependencies updates. The second SDK was for Branch and the third for KeychainAccess- both seemingly unrelated. We recently updated from 6.28.0 to 6.30.0 and noticed no effect on image uploading. While updating the SDK, we also threw in a Performance trace to profile the image upload task so that we could see how long the upload process was taking. See the Relevant Code section regarding implementation of this trace. As for the results, we now have a few hundred samples and the metrics are interesting.

The images being uploaded have a median size of 574 KB, with a median upload duration of 2.18 seconds as shown below.

Screen Shot 2020-09-12 at 8 56 17 PM

The median on WiFi is good, but the 95th percentile is 27.62 seconds which seems unusually long. When looking at non-WiFi, the median is 19 seconds with a 75th percentile of 66 seconds and 95th percentile of 272 seconds. Of course we expect that users with poor service will experience longer uploads, but these numbers seem excessive.

Screen Shot 2020-09-12 at 8 58 31 PM

Finally, we can also look at error rates. Occasionally we are seeing an error in the closure that we record with Crashlytics (50 occurrences in the last 90 days), where the localized description is always An unknown error occurred, please check the server response.. We're not really sure what this means- but this is not a new issue, and was occurring prior our upgrade from 6.18.0 and is still occurring at a similar frequency in 6.30.0.

Looking at the upload duration split up by success/error, it would appear at first glance that the upload duration where the response was Success is very low. However, it only appears that way because the time scale is skewed- as the 95th percentile for duration with Failure responses is 9,687 seconds (yes, that's close to 3 hours). The 75th percentile for Failure is 3,777 seconds, also unusually long.

Screen Shot 2020-09-12 at 8 59 09 PM

The 95th percentile for Success responses is 66 seconds, which demonstrates that even successful uploads sometimes do take very long. Knowing this, and knowing that the occasional error response we are getting is much less frequent than the silent upload failure we're more recently experiencing, our current theory is that the image sometimes takes so long to upload that the user may be exiting the app before the operation completes, and the upload may cease when the app resigns its active state and/or terminates.

Unfortunately, we only implemented this trace after we started noticing the problem- so we don't have data from before to compare it to. We can only venture to guess that these numbers we are seeing now are unusual.

Now, I've already checked the Release Notes history to see if there were any interesting notes for the Storage module, but what I am going to do next is study every module to see if downgrading the SDK is an option for us to troubleshoot. If this were consistently reproducible then we could test this easily in a development environment. Instead, because of our difficulty to reproduce, we would need to deploy the downgraded SDK to production. Doing this may be totally fine, but I am weary of causing problems because we'd be reverting half a year of SDK updates with several feature-adds and bug-fixes that may be relevant to us.

I'm curious is anyone on the Firebase team has any ideas about this. In the mean time, we'll keep poking around and will report if we find anything further. As a workaround and safeguard, we're also going to try caching the image locally until we receive a successful response.

Steps to reproduce:

No reliable method of reproduction, see Relevant Code for... well... the relevant code.

Relevant Code:

guard let imageData = someUIImage.jpegData(compressionQuality: 0.9) else { return }

let storageRef = Storage.storage().reference().child("foobar.jpg")
let metadata = StorageMetadata()
metadata.contentType = "image/jpeg"

let uploadTrace = Performance.startTrace(name: "uploadImage")
uploadTrace?.setValue(Int64(imageData.count), forMetric: "Size")

storageRef.putData(imageData, metadata: metadata) { (metadata, error) in
    if let error = error {
        uploadTrace?.setValue("Failure", forAttribute: "Result")
        uploadTrace?.stop()
        Crashlytics.crashlytics().record(error: error)
    }
    else {
        uploadTrace?.setValue("Success", forAttribute: "Result")
        uploadTrace?.stop()
    }
}
ryanwilson commented 3 years ago

Hi @willbattel thanks for the detailed report and sorry for the issue you're seeing.

There weren't too many changes between those two versions and nothing sticks out to me as a potential issue.. but it's worth a look double checking these to see if anything was changed that could affect this.

git diff 6.18.0...CocoaPods-6.28.0 -- FirebaseStorage/CHANGELOG.md
diff --git a/FirebaseStorage/CHANGELOG.md b/FirebaseStorage/CHANGELOG.md
index 4a8af8699..f3edd66e9 100644
--- a/FirebaseStorage/CHANGELOG.md
+++ b/FirebaseStorage/CHANGELOG.md
@@ -1,3 +1,18 @@
+# 3.8.0
+- [changed] Add error for attempt to upload directory (#5750)
+- [changed] Functionally neutral source reorganization. (#5851)
+
+# 3.7.0
+- [fixed] Fixed a crash when listAll() was called at the root location. (#5772)
+- [added] Added a check to FIRStorageUploadTask's `putFile:` to check if the passed in `fileURL` is a directory, and provides a clear error if it is. (#5750)
+
+# 3.6.1
+- [fixed] Fix a rare case where a StorageTask would call its completion callbacks more than
+  once. (#5245)
+
+# 3.6.0
+- [added] Added watchOS support for Firebase Storage. (#4955)

(leaving out the CHANGELOG.md in the above command will show the actual contents of the changes between the two versions).

Can you share what version of GTMSessionFetcher is being used? Ideally both before and after the upgrade. As far as I recall off the top of my head, Storage uses GTMSessionFetcher under the hood for almost all networking interactions.

One thing that just popped into my head as well: do you have iOS versioning stats for this? Considering iOS 14 beta is still out and growing, I'm wondering if this was introduced in iOS 14 and just happened to line up with your upgrade timing. (Side note: something very well could be wrong inside the Storage SDK and I'm not trying to pass the blame, just trying to see if there are patterns that we can use to help make whatever changes are needed in the SDK 😄 )

We had also seen an issue at one point where some networks were blocking access to Storage (#4913 where it happened on Comcast) and I'm wondering if something similar is going on as well. Either way, that extra info would help and we can try to see if there's anything that could be causing this.

As always, if you happen to find a repro it'll vastly accelerate us being able to track things down. Let's hope we can get your issue fixed soon, thanks!

willbattel commented 3 years ago

Thanks for the comments, Ryan. I'll collect more info and report back later this week.

willbattel commented 3 years ago

Hey @ryanwilson I've done a little more digging.

Can you share what version of GTMSessionFetcher is being used? Ideally both before and after the upgrade. As far as I recall off the top of my head, Storage uses GTMSessionFetcher under the hood for almost all networking interactions.

GTMSessionFetcher was upgraded from 1.3.0 to 1.4.0 during this time. Additionally, we discovered that the problem may have existed in some capacity when we were on SDK version 6.18.0 after upgrading from 6.13.0.

As a means of troubleshooting, to narrow down the possibilities, we are releasing a patch build to production where we have forced GTMSessionFetcher back to 1.3.0. We will monitor the Performance trace to see if there is any difference.

One thing that just popped into my head as well: do you have iOS versioning stats for this? Considering iOS 14 beta is still out and growing, I'm wondering if this was introduced in iOS 14 and just happened to line up with your upgrade timing.

It's a good thought for sure, but this problem has been reported on iOS 11, 12, and 13.

We had also seen an issue at one point where some networks were blocking access to Storage

I wondered about that too- I had read that issue prior to posting this one. The problem is so indiscriminate and inconsistent that I'd be surprised if it were related. That being said, the fact that reproduction is so difficult for our team makes it a possibility- however unlikely.

As always, if you happen to find a repro it'll vastly accelerate us being able to track things down.

This is what's really grinding my gears haha! There are so many occurrences but we can't find a pattern.


There is one more possibility that hadn't occurred to us until recently: the problem may have always existed but was unnoticed until recently when two things changed on our end. The first change was that our user base grew significantly during this time, so there were more opportunities for the problem to happen. The second, more direct reason, I describe below.

As mentioned in my original comment, we are uploading these images as components on user-generated posts. The post contents are kept in Firestore with a URI field to the image in GCS. Google does not support cross-product transactions, meaning we can't atomically write both the Firestore document and GCS image together- they have to be separate operations.

Prior to this problem being noticed, we had always uploaded the image to GCS and awaited a successful response before writing the metadata to Firestore. This ensured that the image was uploaded when we created the Firestore document. The problem with this flow was the possibility of orphaned images, where after uploading the image the Firestore operation doesn't complete.

As a measure of good data hygiene, we shifted to running the Firestore operations FILO relative to their GCS counterparts. This ensured that in the event of any orphanage, the orphaned data would be on Firestore and therefore much easier to detect and handle. As far as we know, there isn't a very efficient way to find orphaned objects in GCS. On paper, having Firestore run FILO seems like a good idea, because we can be confident that we aren't leaving any abandoned data behind.

Of course, this change surely could have brought this issue to light. This problem may have very well always existed, but it would not have been noticed because we were only creating the Firestore document upon successful GCS operations. The problem would have been masked because the Firestore document- which is what represents the post- would not be created if the upload doesn't succeed.

Assuming that this is indeed the defining difference that exposed the problem that had already existed quietly, the question changes to how we address it. If you look at the code snippet in my original comment, we aren't doing anything fancy. The fact that this upload operation averaging well under a megabyte is failing so frequently is strange to us. We're not really sure how to make it more resilient than it already is, other than maybe manually caching the images ourselves and retrying the upload operation periodically until it succeeds- but this would seem redundant to the mechanisms already present in the Firebase Storage SDK.

So the big question is, why are 1 in 10 upload operations failing on what should be a resilient platform? Our users aren't all farmers living in the middle of nowhere. Also, we can practically see the Storage SDK attempting to retry given the metrics we are seeing tagged as "Failure" showing 3 hours of upload time before finally giving up.

The only course of action that we would expect to reverse this would be to revert to GCS-first operations, knowing that we are only masking the problem and not truly solving it. We don't see any clear solution- other than Google supporting Firestore+GCS atomicity which we don't see happening any time soon (which is a shame because I think those products are very frequently paired together). We could send both the image and metadata to a Cloud Function to be handled there, but if the Storage SDK can't upload the image then odds are the Cloud Function request won't make it to the server, either.

These questions I've asked are all rhetorical- we don't expect anyone at Firebase to know why this is happening. But if you have any suggestions, we are all-ears.

willbattel commented 3 years ago

Speaking further to this idea, there has been some further discussion on our end that I would like to share.

We believe that, while there may or may not be a problem with the Storage SDK, what is really exacerbating the situation we are seeing is the fundamental difference in how the Firestore and Cloud Storage clients work.

Firestore is quite intelligent; its offline persistence feature ensures that the local writes are available to the client for reading regardless of connectivity, and that the writes will eventually propagate to the server when the app is back online. When the client creates a new document in Firestore while internet connection is limited or unavailable, that document will be readable in much the same way as when using a high quality connection. This feature is fantastic...but can lead to friction when paired with other products such as Storage.

Firebase Storage is much simpler than Firestore when it comes to read/write operations. While the Storage SDK is designed to be robust, it does not share the offline persistence mechanism found in Firestore. This means that when an upload fails, that's it. There are no do-overs or eventual consistency/propagation like in Firestore. What's more is that there is no concept of "local storage" in the Storage SDK, so if we want to download an object that was very recently uploaded from the same client- even the same session- the Storage SDK will retrieve it from the server in the same manner as if the object had never at any point existed on the client.

In a perfect world, cross-product transactions would be supported and the Storage SDK would keep local copies of recently uploaded files to be used in download operations. Because our world is far from perfect, the best we can do is ask for these nicely and try to make it work as-is. Our current plan is to keep local copies of the images we upload until the operation completes successfully. On each session start we will retry the upload for any remaining images. Then, a new boolean field can be added to the Firestore documents to indicate if the image has been uploaded- so that we can more gracefully handle cases where the image upload is still pending.

Let me know what you think about all of this. Firestore and Storage are used together so commonly that we can't help but wonder what patterns other developers have adopted, or what is recommended by Firebasers.

schmidt-sebastian commented 3 years ago

I will take a look at this next week.

schmidt-sebastian commented 3 years ago

@willbattel Sorry for being a bit slow here. The sheer amount very useful information scared me away a little bit. If I summarize this in my own words, we have now come to the conclusion that while we now have more metrics, the problem likely exists in all versions of the SDK and all versions of iOS.

There seem to be a couple ways forward:

willbattel commented 3 years ago

No worries about the delay, we've started developing a partial workaround in the mean time. We're going to cache a limited number of recent uploads and attempt to use the cache before downloading from Storage. We're also going to add a pending flag to associated Firestore documents with a garbage collector that periodically removes documents whose images never made it to the server.

I agree that offline support, at least to some degree, would be a great addition to the Storage SDK. I also suggest starting a discussion on improving the synergy between products, such as between Firestore and Storage. This is a much taller order but it seems that these products are used together so often that the added value would be worth the effort. This is outside the scope of this issue so I won't press it any further here, but if you're open to continuing the conversation I'd be more than happy to add further information.

Regarding the maxUploadRetryTime setting, we have not yet tried this but may do so just to see what happens. I'll let you know if there is anything to report from this.

Regarding the 10% failure rate, we're not seeing many errors. We record failures with Crashlytics but we only get the occasional An unknown error occurred, please check the server response which only happens maybe 1 in 500 uploads and doesn't really tell us anything. It seems like the app tends to be resigned, either to the background or terminated altogether, before the Storage SDK gives up on the upload attempt- as would explain the Performance metrics above. Maybe the maxUploadRetryTime can help force an error rather than allow Storage to run with it for hours and hours.

schmidt-sebastian commented 3 years ago

We also received an internal report: b/172486653

schmidt-sebastian commented 3 years ago

One follow up: Are you using https://developer.apple.com/documentation/foundation/nsurlsessiontaskdelegate/1643148-urlsession directly or pulling in dependencies that might be collecting metrics using the URLSession:task:didFinishCollectingMetrics API? GTMSessionFetcher just added support for this API endpoint (via https://github.com/google/gtm-session-fetcher/issues/187).

willbattel commented 3 years ago

One follow up: Are you using https://developer.apple.com/documentation/foundation/nsurlsessiontaskdelegate/1643148-urlsession directly or pulling in dependencies that might be collecting metrics using the URLSession:task:didFinishCollectingMetrics API? GTMSessionFetcher just added support for this API endpoint (via google/gtm-session-fetcher#187).

No, we aren't using that API- though it might be a good idea for us to do so.

The good news is that, for reasons unbeknown to us, the upload failure rate seems to have reduced. Furthermore, we believe that many of our upload problems are actually caused by the app being resigned to the background or being terminated before the upload completes. The resolution of https://github.com/firebase/firebase-ios-sdk/issues/147 would then likely help alleviate some of this problem. Additionally, when the app is terminated, we would like for the Firebase Storage SDK to automatically resume/restart pending upload requests when the app is launched. The other side to this problem is where we sometimes want to immediately access a file that has been queued for upload- regardless of whether the upload has finished (or even started).

Some of these requests have reasonable workarounds, so the situation isn't dire by any means. However, these workarounds require a harmonious dance between the Storage SDK and our own caching/retry system that, in our opinion, would be better handled entirely within the Storage SDK.

Going forward from here, I'd like to:

  1. Change this issue into a standing feature request for offline / immediate access support (if there is not already an open issue for this) where we can "download" files that have been uploaded from the same device through a cache in the Storage SDK. Optionally, if you would prefer, I could instead close this issue and open a new one for the FR;
  2. Show our interest for the resolution of https://github.com/firebase/firebase-ios-sdk/issues/147;
  3. Request that the Storage SDK documentation be updated to include more specific information about transfer behavior throughout the app session lifecycle (foreground → background → termination → foreground);
  4. Indicate our interest for resumed/restarted uploads on app launch if not already supported (I had a hard time finding a concrete answer, and testing in-app behavior was sometimes inconsistent); and
  5. If our original problem persists, gather more information (such as with the didFinishCollectingMetrics API) and then open a new, more purposeful and specific bug report issue.
schmidt-sebastian commented 3 years ago

Thanks for the detailed update!

I changed this issue to a feature request. Offline support is a pretty popular feature, and we use input like yours to prioritize this feature against work on other products. Unfortunately, this is likely going to be a pretty big undertaking (as we would likely want it to work seamlessly in all feasible scenarios), so any work arounds from your end will likely have to be in place for a while.

I filed a separate issue (internal) to update our documentation - b/174154547

Doppelklick commented 1 year ago

@schmidt-sebastian can you give us an update about this feature request?

paulb777 commented 1 year ago

@tonyjhuang Any updates?

tonyjhuang commented 1 year ago

Hi, we're tracking this feature request internally but unfortunately I don't have an ETA to share at this time.

Doppelklick commented 1 year ago

Hi, we're tracking this feature request internally but unfortunately I don't have an ETA to share at this time.

@tonyjhuang is there any activity on this request?