braintree / braintree_ios

Braintree SDK for iOS
https://developer.paypal.com/braintree/docs/start/hello-client/ios/v5
MIT License
556 stars 290 forks source link

[QL] Investigate Triple v1/config call #1308

Closed KunJeongPark closed 2 weeks ago

KunJeongPark commented 1 month ago

Description

The SDK is sometimes making 3 sequential network calls to v1/configuration. This only happens when you create an APIClient, PayPalClient, and then tokenize() all immediately in sequence of one another (on button click). Technically, our docs snippet shows instantiating the BTAPIClient earlier than button click (ex: viewDidLoad). But ideally our SDK can be more performant if merchants aren’t reading our docs. Also, our docs can better call out this recommendation.

I was able to reproduce this issue by instantiating APIClient immediately prior to checkout flow. If the configuration is not cached from earlier instantiation ofAPIClient, APIClient instantiation, tokenize call and sendAnalyticsEvent try to call v1/configuration endpoint since network calls are made in quick succession asynchronously without configuration having been added to cache yet.

This is what I saw:

I considered using actor for ConfigurationCache but I don't know of a way of handling this in conjunction with BTHTTP get call in the fetchOrGetRemoteConfiguration function which returns via completion handler and is utilized throughout the code base. Also, the issue is that network calls get made in quick succession before result is written into the cache (this only happens if merchants don't adhere to our recommendation of instantiating BTAPIClient early on in their application), so having serialized access in the ConfigurationCache would not solve the issue.

I went the route of adding a loading state for v1/configuration and adding completion handlers if more calls were made during loading state and on network call return, returning the same result to completion handlers in the array of completion handlers.

Commit 2, I considered just putting sync serial queue around just cache writes but realized that the mutable properties like array and isLoading state were not thread safe.

Commit 3 I used concurrent queue for outer block of fetchOrGetRemoteConfiguration function (for concurrent reads since it should only write into cache one time during session) and barrier option for tasks that any write to mutable properties

TODO:

Checklist

Authors

List GitHub usernames for everyone who contributed to this pull request.

scannillo commented 1 month ago

Some high level background:

The long-term goal is to make BTAPIClient less of a "god" object, since it currently does so many things. Android sets a good pattern for this, with a dedicated ConfiguationLoader class.

This PR could be a good excuse to move to that pattern, since now we're adding even more logic specific to getting a config from either the cache or the network.

The long-term goal also is to mirror Android's separation of BTHTTP, as well:

scannillo commented 1 month ago

Another comment re: tests for this change. Can we add a unit test that fails against main? This way we know that this change is properly addressing the issue (the test would then pass with the new changes).

scannillo commented 1 month ago

I think there's several ways we can address this issue:

  1. At the HTTP / URLSession level. If we have an incoming URLSessionDataTask request incoming and detect that a matching one is already pending progress, cancel the incoming request.
    • This one will prevent this issue from happening with other HTTP requests (beyond just the config). Is that something we should consider?
  2. At the BTAPIClient / ConfigurationLoader level like you're doing here.
  3. Our SDK avoids calling fetchOrReturnRemoteConfiguration so many times in sequence, and move to a dependency injection model for the BTConfiguration where possible (for ex: to initialize a BTAnalyticsService).

❓ - does Android have this same issue if a merchant initializes there client & calls tokenize() all upon button click? If so, does the same issue arise? Asking b/c the more we can align the better.

KunJeongPark commented 1 month ago

I think there's several ways we can address this issue:

  1. At the HTTP / URLSession level. If we have an incoming URLSessionDataTask request incoming and detect that a matching one is already pending progress, cancel the incoming request.

    • This one will prevent this issue from happening with other HTTP requests (beyond just the config). Is that something we should consider?
  2. At the BTAPIClient / ConfigurationLoader level like you're doing here.
  3. Our SDK avoids calling fetchOrReturnRemoteConfiguration so many times in sequence, and move to a dependency injection model for the BTConfiguration where possible (for ex: to initialize a BTAnalyticsService).

❓ - does Android have this same issue if a merchant initializes there client & calls tokenize() all upon button click? If so, does the same issue arise? Asking b/c the more we can align the better.

I will look into Android code to see if they have same issue.

KunJeongPark commented 1 month ago

Another comment re: tests for this change. Can we add a unit test that fails against main? This way we know that this change is properly addressing the issue (the test would then pass with the new changes).

I just did a quick test on local branch where I instantiated the BTAPIClient on button click along with payment client and call to tokenize method but I will write a proper unit test.

sshropshire commented 1 month ago

@KunJeongPark @scannillo I could see this being an issue on Android too. We don't have logic to guard against multiple calls for configuration fetching before it's cached, so it's possible both iOS and Android have this issue.

scannillo commented 1 month ago

@KunJeongPark @scannillo I could see this being an issue on Android too. We don't have logic to guard against multiple calls for configuration fetching before it's cached, so it's possible both iOS and Android have this issue.

Definitely agree, I think digging into Android will be useful to do now! This way we can pick a solution that we can align as much as possible. Rich is adding the same network timing analytics to Android in DTMOBILES-625, which will be helpful for us to compare against iOS.

For some background - here's where the iOS numbers are currently at. Note, these analytics were added in a recent version so we only have a few merchants sending them (~11). But that # of events for v1/configuration is way too high, given the # of calls to the paypal_hermes endpoints. This looks like a bug in our SDK, and not just the merchant integration edge case Victoria's PR summary mentions.

Screenshot 2024-06-06 at 12 32 49 PM

(Percentiles are the the p95 value for total time in milliseconds the SDK spends waiting for these API calls to complete).

sshropshire commented 1 month ago

@scannillo oh wow yeah that's a lot. When we fix this bug we'll probably reduce a ton of gateway traffic. That'll be a big win.

KunJeongPark commented 3 weeks ago

@KunJeongPark @scannillo I could see this being an issue on Android too. We don't have logic to guard against multiple calls for configuration fetching before it's cached, so it's possible both iOS and Android have this issue.

Definitely agree, I think digging into Android will be useful to do now! This way we can pick a solution that we can align as much as possible. Rich is adding the same network timing analytics to Android in DTMOBILES-625, which will be helpful for us to compare against iOS.

For some background - here's where the iOS numbers are currently at. Note, these analytics were added in a recent version so we only have a few merchants sending them (~11). But that # of events for v1/configuration is way too high, given the # of calls to the paypal_hermes endpoints. This looks like a bug in our SDK, and not just the merchant integration edge case Victoria's PR summary mentions.

Screenshot 2024-06-06 at 12 32 49 PM

(Percentiles are the the p95 value for total time in milliseconds the SDK spends waiting for these API calls to complete).

We don't know if this is an edge case integration though. And were the above stats were from adding fetchAPITiming function? I think I saw recursion paths from there. Injecting Config object to BTAnalyticsService, removing call to fetchOrGetRemoteConfig in sendAnalyticsEvent takes care of this recursion issue if that's what it is, but we need handling of quick config fetch calls with some locking, waiting mechanisms like on this PR.

I am thinking of three different integration patterns for BTAPIClient: Scenarios #1 BTAPIClient instantiation used for different payment clients, instantiated earlier (maybe somewhere out there) Scenario #2 BTAPIClient instantiated before paymentClient instantiation. these are instantiated before tokenize function. (on main) Scenario #3 BTAPIClient instantiation, paymentClient instantiation, tokenize called all at once (shown on branch on this PR: https://github.com/braintree/braintree_ios/pull/1333)

KunJeongPark commented 3 weeks ago

I'm going to continue on the spike on this other branch (https://github.com/braintree/braintree_ios/pull/1333) because there are issues on main with the demo app after shopper insights feature branch merge. I will move this implementation into ConfigLoader class. But on local testing, I am not seeing those monster numbers. Maybe 3 - 5 calls to v1/config at once if APIClient is instantiated on button press.

Maybe I need to test longer periods for cache expiration. Just thinking about how many checkout sessions people would have in a single app session.

After seeing some results, I can move over stuff to another branch.

KunJeongPark commented 3 weeks ago

Another comment re: tests for this change. Can we add a unit test that fails against main? This way we know that this change is properly addressing the issue (the test would then pass with the new changes).

Addressed here: c743823731d4ee902ae3f9240589265d30632017

jaxdesmarais commented 2 weeks ago

👋 Sorry it's taken some time to review this. I have a couple of potential concerns with this approach:

  1. configCompletionHandlers - while I think we've proven this technically fixes the issue, the solution at least for me to have an array of completions feels a bit odd. It seems like there could be an instance where we have a mix of configs and errors in the array that could lead to some unexpected behavior. => I don't think so. The array should return the same result from the barrier block.
  2. Mixing of queues - this PR introduces a fetchConfigQueue, calls to the main queue, and calls to a barrier flagged queue which to me is a bit of a code smell. We ideally don't need to wrap queues in other varying queues to have things behave as expected or can just have 1 blocking queue in 1 place to stop the execution of the configuration fetch until the first one is returned. => The barrier flag is on the same queue as the queue that wraps the entire function. The barrier block in addition to the serial queue is because the network call is async, calls can still finish, kicking off network call and start another call to the function before result is returned without the barrier. The BTHTTP's get function returns results via main in handleRequestCompletion -> callCompletionAsync function. The dispatchQueue there is defaulted to main and I don't see anywhere we override it. So I thought having results return via different thread in the block outside barrier block would result in something unexpected for the merchant's apps.

I think if we take the approach of using a ConfigurationLoader similar to Android we can resolve the two issues above. I do think the test testFetchOrReturnRemoteConfiguration_with_multiple_calls here will be great for confirming we solve this same problem when we implement the ConfigurationLoader.

=> Yes, I agree with implementing ConfigurationLoader. My thinking was that we should still address possible concurrency issues in ConfigurationLoader as well. And difficulty of implementing prefetch of config with ConfigurationLoader.

=>If we are good with ignoring prefetch and consider possible increased delay in some merchant integrations, I just wanted to call that out. I was concerned about separating out different things we are changing and measuring. I do think ConfigurationLoader without prefetch is the cleanest, simplest approach. Just wanted to note the change other than stopping multiple calls.

KunJeongPark commented 2 weeks ago

sorry @jaxdesmarais didn't mean to edit your comment, I meant to reply to. I prepended my comments with "=>"

scannillo commented 2 weeks ago

=> Yes, I agree with implementing ConfigurationLoader. My thinking was that we should still address possible concurrency issues in ConfigurationLoader as well. And difficulty of implementing prefetch of config with ConfigurationLoader.

I think we should solve this issue in the ConfigurationLoader and not in the BTAPIClient. BTAPIClient already is a "god" object and does too many things. We want to achieve higher separation of concerns.

Here's what I'm thinking for steps forward:

  1. Step 1
    • Refactor existing logic as-is within fetchOrReturnRemoteConfiguration into a ConfigurationLoader class.
    • Model Android (constructor takes BTHTTP & load() method takes Authorization param).
    • Keep public function signature for BTAPIClient.fetchOrReturnRemoteConfiguration() the same so feature clients don’t have to change.
    • BTAPIClient.fetchOrReturnRemoteConfiguration() will call out to the ConfigurationLoader.getConfig(authorization)
  2. Step 2
    • Within ConfigurationLoader, implement blocking mechanism to prevent duplicate outbound v1/config GET requests from happening at the same time.