braintree / braintree_ios

Braintree SDK for iOS
https://developer.paypal.com/braintree/docs/start/hello-client/ios/v5
MIT License
557 stars 291 forks source link

Remove config caching throttle delay #1287

Closed scannillo closed 2 months ago

scannillo commented 2 months ago

Edited, please re-read:

Summary

GOAL: The main goal of this PR is to remove the 100ms delay added in PR https://github.com/braintree/braintree_ios/pull/807 to our configuration fetching. At the same time, we want to make sure we don't re-introduce the increased "Network connection lost" errors that PR https://github.com/braintree/braintree_ios/pull/807 aimed to solve.

PROBLEM: The problem with this 100ms delay is that it occurs every time we BTAPIClient.fetchOrReturnRemoteConfig(), which is many times throughout the SDK. For example: in the PP Vault flow, this method is called a total of 6 times. This could be adding up to 600ms of delay.

BACKGROUND: We know that the URLCache logic implemented in PR https://github.com/braintree/braintree_ios/pull/789 caused an LE merchant to see an increase in "Network connection lost" errors. PR https://github.com/braintree/braintree_ios/pull/807 reduced the frequency of those errors by delaying all retrievals from the URLCache by 100ms.

This bug was hard to replicate consistently, as described by Jax below.

THEORY: After digging in about how URLCache works, I learned that it is a highly asynchronous queue. From this SO post, "You have to let the main run loop return before it will be available, and possibly even wait a little while".

I was able to write failing unit tests to witness this lag in URLCache, see commit bb7e2f21251d70f83d9e2195fe0c4daa47210c2b. Putting things in the queue and asserting their existence synchronously always failed. Until I added a DispatchQueue with a ~100+ms delay on the assertion statements, then the tests started passing. Uncanny! To me, this is a signal that our fix in #807 was addressing a symptom of the cache being updated too slowly and not the root issue.

PROPOSED CHANGES: In this PR, I move away from URLCache and to a basic NSCache setup via a singleton. NSCache is thread-safe, auto-evicts if app memory overloads, and is fast-access since it is in-memory storage only. I mirrored some of the logic in our Android SDK (see code). The failing unit tests from commit bb7e2f21251d70f83d9e2195fe0c4daa47210c2b are now able to pass, resolving the issue of URLCache working too slowly for our needs.

I feel like this knowledge & fix + the fact that 0.19% of Venmo iOS users are on old app versions (pre-loading screen), we are not at a risk to re-introduce the "Network connection lost" issue.

Changes

Testing

After extensive local testing with Xcode Instruments signposts, we are seeing a range of 15ms - 550ms reduction in elapsed time with this change. Across all trials, we see an average of -247ms reduction in elapsed time.

Checklist

Authors

@scannillo

scannillo commented 2 months ago

@sarahkoop or @jaxdesmarais - do you remember which combination of iOS versions & Venmo app versions that this delay was initially added for? With the current Venmo app I'm not able to replicate any issues. I'm hoping since this PR is 2 years old that the version of the Venmo app in question then has minimal to no active usage now.

sarahkoop commented 2 months ago

@sarahkoop or @jaxdesmarais - do you remember which combination of iOS versions & Venmo app versions that this delay was initially added for? With the current Venmo app I'm not able to replicate any issues. I'm hoping since this PR is 2 years old that the version of the Venmo app in question then has minimal to no active usage now.

It was definitely older versions, but I can't find a list of actual version numbers. I think it's likely very minimal, if any, traffic, but might be worth confirming with the Venmo team

scannillo commented 2 months ago

For some reason the Amex & 3DS UI tests are broken from this PR, but the Demo features are working OK on their own. Looking into it 🤔

scannillo commented 2 months ago

Ok, here's our breakdown of iOS version usage across the BT SDKs from Ian's dashboard:

iOS 17 = 81.95% iOS 16 = 14.89% iOS 15 = 1.50% iOS 14 = 0.04% iPadOS 17 = 1.22% iPadOS 16 = 0.32% iPadOS 15 = 0.07%

scannillo commented 2 months ago

It was definitely older versions, but I can't find a list of actual version numbers. I think it's likely very minimal, if any, traffic, but might be worth confirming with the Venmo team

I asked them in Slack! They said they'll have to do some digging to see which version the loading screen was added in.


I also found this in an old thread from you and Jax. It helps give an idea of what versions you tested on:

iPhone 12 mini + iOS 15.4.1 + Venmo app version 9.18.0: 0 failures in 100 attempts with fix iPhone 6 + iOS 13.2.2 + Venmo app version 7.42.0: 1 in 100 with fix, 50 in 100 on 5.7.0, 1 in 100 on 5.5.0 Phone X, iOS 14.0.1, Venmo app version 8.8.0: 0 in 60 with fix (Venmo app force updated), 8 in 100 on 5.7.0, 1 in 100 on 5.5.0 iPhone X, iOS 14.0.1, Venmo app version 9.18.0: 1 in 50 with fix

Sounds like Venmo versions 7.42.0 and 8.8.0 were the ones y'all tested on and saw the error. I'm looking at Venmo's iOS app usage now, and they have 0 traffic on 7.42.0 and ~0.003% of users on 8.8.0.

jaxdesmarais commented 2 months ago

Sounds like Venmo versions 7.42.0 and 8.8.0 were the ones y'all tested on and saw the error. I'm looking at Venmo's iOS app usage now, and they have 0 traffic on 7.42.0 and ~0.003% of users on 8.8.0.

Yep, at the time Venmo did not have the ability to provide any older versions for us to test with, but we did observe this issue with the versions noted. Like you said since it's 2 years old general updates have likely mitigated this issue. Should we consider letting DoorDash know we are making this update? This was a P0 at the time so I feel like we should consider giving them some sort of heads up to ramp this update on their end.

scannillo commented 2 months ago

Another update - I checked with Venmo and v9.18.0 introduced the initial spinner/loading screen for the app switch checkout flow. I just dug into their app usage stats, and versions 9.17.0 and earlier make up the below % of active Venmo iOS users:

(this number will continue to decline over time)

Should we consider letting DoorDash know we are making this update? This was a P0 at the time so I feel like we should consider giving them some sort of heads up to ramp this update on their end.

This is a good call. Ideally we don't have to loop them in, since if we deem the traffic high enough, we will do our best to make sure this issue doesn't resurface. I am thinking about implementing a serial queue on the URLCache implementation and removing the timeout as a way to solve both problems.

scannillo commented 2 months ago

Just had a 💡 moment. main wasn't only adding 100ms of delay to each network request to v1/configuration, it was adding 100ms of delay every time fetchOrReturnRemoteConfig() is called, wether it was a cache hit or miss. That method is called 6 total times from instantiating a feature client --> ASWebSession.start() in the PP Vault flow. That's ... 600 ms added.

Safe to say, this PR should severely improve latency!

Also shoutout to @KunJeongPark - who helped with timing analysis that made this issue more obvious. 👏 The more folks we can have A/B test the two branches and add their result to our Excel doc, the better.