GoogleCloudPlatform / recaptcha-enterprise-mobile-sdk

Apache License 2.0
29 stars 5 forks source link

Failures during SDK initialization and when requesting assessment token. #21

Closed marosabal closed 1 year ago

marosabal commented 1 year ago

Describe the bug

We are getting the following error description: The operation couldn’t be completed. (com.google.recaptcha error 1.)

Integration Method

SDK Version (18.0.3):

To Reproduce

Steps to reproduce the behavior:

  1. Initialize the SDK or request an assessment token.

Expected behavior

Screenshots SDK initialization:

Screenshot 2022-12-29 at 11 51 45

Assessment token request:

Screenshot 2022-12-29 at 11 51 54

Xcode version:

Device:

mcorner commented 1 year ago

What percentage of the time does this happen the first time and on retries? If it fails once does it always fail?

Any other patterns you have noticed? (OS, network conditions, devices etc)?

marosabal commented 1 year ago

For approx 6.6% percentage of the users, the SDK initialization fails the first time. Half of them fail on retries. For approx 1% percentage of the users, the assessment token request fails. Approx 0.2% of users had internet connection problems.

Top devices: 14..., 13..., 12... Top iOS version: 16..., 15...

mcorner commented 1 year ago

Excellent, thanks for the details. We will dig into this.

Internal bug reference: b/264248461

mcorner commented 1 year ago

Also, there were some changes that hit prod around Dec. 20 that would improve init. LMK what dates your data is from.

mcorner commented 1 year ago

@marosabal Also, is it always the same error "The operation couldn’t be completed. (com.google.recaptcha error 1.)"? This corresponds to a network error.

marosabal commented 1 year ago

I have seen these two errors only:

Screenshot 2023-01-06 at 10 25 16

But most of them are related to a network error.

mcorner commented 1 year ago

Just to let you know we do see these errors in our logging and metrics at similar rates. We are still tracking down the source of the errors and may require a new SDK to help us dig a bit further. Thanks for your patience.

andrewjmeier commented 1 year ago

Hi @mcorner, do you have an update on this? We're running into it as well.

mcorner commented 1 year ago

Yes, this has been a big focus for us lately. What we are finding is that the vast majority of the issues are related to slow networking (or sometimes unusually slow devices). However, we are reporting those network errors or slowness with "Internal error" which isn't an accurate view of what is going on.

So two things are happening: a) the reporting of exceptions is going to change from "internal error" to networking error when that is the case and b) we are mitigating slow devices using a variety of methods. Both of these will show up as much lower rates of internal error.

Both of these changes are backend changes, but we put these into production in a slow and controlled way. So these will ramp up over the next two to three weeks. Most of our improvements don't come in the form of new SDK releases, they come from these backend changes.

In the meantime we suggest wrapping calls to the SDK in retries. This is pretty much a given in mobile devices anyway where we see an incredible distribution of networking reliability and speeds.

We are also planning to remove the built in timeout from the SDK to allow you to make your own judgements about how long to wait.

andrewjmeier commented 1 year ago

@mcorner any news on rolling this out more? We've had to turn off our recaptcha check server side because so many requests are failing. A more descriptive error message than "internal error" would be nice (we're logging over 5k of those a day) but even with retrying the request 3 times we still have about 1000 users a day who fail to successfully fetch a token.

mcorner commented 1 year ago

The good news is that there are several fixes rolling out now. If your issue is particular to ios < 14 then 18.1.1 (released yesterday, but no release notes yet) will address that problem. If it is >=14 there are two fixes rolling out. One next week will return network error instead of internal error when the SDK cannot contact our servers. The other is a massive speedup/improvement in execute that is rolling out now and will take 3-4 weeks to fully hit production traffic (we generally do everything slowly to detect regressions). I appreciate your patience while we bash our post GA bugs.

andrewjmeier commented 1 year ago

Is there a way to get our account access to the rollout before it's 100%? The risk is super low on our end because we're not actually using the recaptcha tokens we generate right now since so many clients are failing to fetch one.

I'll go ahead and bump the version but iOS 14.0 is our minimum version.

mcorner commented 1 year ago

We don't have a way to do that ATM.

BTW, are you having errors mostly on init or execute?

There is another SDK change in the works for execute that addresses problems in low mem, cpu, or apps that have been backgrounded for a while. That should be out in a couple of weeks.

andrewescutia commented 1 year ago

@mcorner in regards to the retry recommendations. Is this only to help combat device connectivity issues? Or can a subsequent call actually provide a different response? Trying to determine how to approach it.

Thanks.

mcorner commented 1 year ago

@andrewescutia The retries are to combat connectivity issues, yes.

A subsequent call can succeed when a previous one failed.

Given the number of changes that have occurred in the SDK since this bug was originally posted, I am going to close it. I expect that it may be reopened but I think we can tackle it in a more specific way.

There was a release last week of v18.1.2 for iOS and we just finished the Android release v18.1.2 as well. Release notes have, or will, appear here: https://cloud.google.com/recaptcha-enterprise/docs/release-notes