firebase / firebase-android-sdk

Firebase Android SDK
https://firebase.google.com
Apache License 2.0
2.28k stars 579 forks source link

Firelog sdk is using JobScheduler which is not recommended now #6147

Closed mrober closed 1 month ago

mrober commented 3 months ago

From @ankit92:

com.google.android.datatransport/transport-runtime library is using JobScheduler which is not recommended now as mentioned in android documentation: https://developer.android.com/about/versions/14/behavior-changes-14

_JobScheduler reinforces callback and network behavior Since its introduction, JobScheduler expects your app to return from onStartJob or onStopJob within a few seconds. Prior to Android 14, if a job runs too long, it stops and fails silently. If your app targets Android 14 (API level 34) or higher and exceeds the granted time on the main thread, the app triggers an ANR with the error message "No response to onStartJob" or "No response to onStopJob". Consider migrating to WorkManager, which provides support for asynchronous processing or migrating any heavy work into a background thread._

All ANR's which spiked in our app are on Android 14 , as mentioned in above chat threads, and mostly related JobScheduler.

Strangely 75% of events are on OnePlus devices and rest 25% are on OPPO devices

Here are the logs:

art::ConditionVariable::WaitHoldingLocks ANR triggered by thread waiting for a binder transaction

main (native) tid=1 systid=30996 Triggered ANR Root blocking 0 libc.so syscall + 28 1 libart.so art::ConditionVariable::WaitHoldingLocks + 140 2 libart.so artJniMethodEnd + 336 3 libart.so art_jni_method_end + 12 android.os.BinderProxy.transactNative (Native method) This Binder call may be taking too long, causing the main thread to wait, and triggering the ANR. android.os.BinderProxy.transact (BinderProxy.java:685) android.app.job.IJobCallback$Stub$Proxy.acknowledgeStartMessage (IJobCallback.java:434) android.app.job.JobServiceEngine$JobHandler.ackStartMessage (JobServiceEngine.java:384) android.app.job.JobServiceEngine$JobHandler.handleMessage (JobServiceEngine.java:196) android.os.Handler.dispatchMessage (Handler.java:106) android.os.Looper.loopOnce (Looper.java:257) android.os.Looper.loop (Looper.java:368) android.app.ActivityThread.main (ActivityThread.java:8839) java.lang.reflect.Method.invoke (Native method) com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run (RuntimeInit.java:572) com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1049)

This is becoming more critical now and shaken all our app's performance metrics in terms of ANR's

Raised this to help you so that you can get more insights and request you to please provide us with the fix as soon as possible . Thanks

Originally posted by @ankit92 in https://github.com/firebase/firebase-android-sdk/issues/4345#issuecomment-2262159437

davidmotson commented 3 months ago

This is being worked on along with other improvements for API level 34 (android 14), and it should be rolled out across all Firebase Android SDKs soon.

stepushchik commented 3 months ago

The same:

         main (native):tid=1 systid=20068 
#00 pc 0x99ccc libc.so (syscall + 28) (BuildId: a87e89fc2c0a5753053f536add4d7ae1)
#01 pc 0x232670 libart.so (art::ConditionVariable::WaitHoldingLocks + 140) (BuildId: 2452917c4ff69cbb6e75e5512260946b)
#02 pc 0x45a5e4 libart.so (artJniMethodEnd + 336) (BuildId: 2452917c4ff69cbb6e75e5512260946b)
#03 pc 0x5be67c libart.so (art_jni_method_end + 12) (BuildId: 2452917c4ff69cbb6e75e5512260946b)
       at android.os.BinderProxy.transactNative(Native method)
       at android.os.BinderProxy.transact(BinderProxy.java:685)
       at android.app.job.IJobCallback$Stub$Proxy.acknowledgeStartMessage(IJobCallback.java:434)
       at android.app.job.JobServiceEngine$JobHandler.ackStartMessage(JobServiceEngine.java:384)
       at android.app.job.JobServiceEngine$JobHandler.handleMessage(JobServiceEngine.java:196)
       at android.os.Handler.dispatchMessage(Handler.java:106)
       at android.os.Looper.loopOnce(Looper.java:257)
       at android.os.Looper.loop(Looper.java:368)
Screenshot 2024-08-20 at 13 50 31
ankit92 commented 3 months ago

This is being worked on along with other improvements for API level 34 (android 14), and it should be rolled out across all Firebase Android SDKs soon.

@davidmotson Is there any schedule or planned date for next Firebase Android BOM release which includes these fixes? These ANR's are becoming critical now for our app.

ankit92 commented 3 months ago

@davidmotson @mrober Does this issue got fixed in BOM version 33.2.0?

mandeep1999 commented 3 months ago

Is there any timeline when the fix for this will be released?

RelappsStudio commented 2 months ago

Same issue here:

art::ConditionVariable::WaitHoldingLocks
ANR triggered by thread waiting for a binder transaction

          main (native):tid=1 systid=17264 
#00 pc 0x99ccc libc.so (syscall + 28) (BuildId: a87e89fc2c0a5753053f536add4d7ae1)
#01 pc 0x23247c libart.so (art::ConditionVariable::WaitHoldingLocks + 140) (BuildId: 4801adadf1dae7e020ba05f3204efc9c)
#02 pc 0x45b218 libart.so (artJniMethodEnd + 336) (BuildId: 4801adadf1dae7e020ba05f3204efc9c)
#03 pc 0x5bef7c libart.so (art_jni_method_end + 12) (BuildId: 4801adadf1dae7e020ba05f3204efc9c)
       at android.os.BinderProxy.transactNative(Native method)
       at android.os.BinderProxy.transact(BinderProxy.java:685)
       at android.app.job.IJobCallback$Stub$Proxy.acknowledgeStartMessage(IJobCallback.java:434)
       at android.app.job.JobServiceEngine$JobHandler.ackStartMessage(JobServiceEngine.java:384)
       at android.app.job.JobServiceEngine$JobHandler.handleMessage(JobServiceEngine.java:196)
       at android.os.Handler.dispatchMessage(Handler.java:106)
       at android.os.Looper.loopOnce(Looper.java:257)
       at android.os.Looper.loop(Looper.java:368)
       at android.app.ActivityThread.main(ActivityThread.java:8815)
       at java.lang.reflect.Method.invoke(Native method)
       at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:572)
       at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1049)

Surprisingly only on android 14, only ColorOS supported devices

image

Any bypass suggestions before fix? Anyone managed to find some workaround?

stepushchik commented 2 months ago

Relates to https://developer.android.com/jetpack/androidx/releases/work#2.10.0-alpha03

  • Fix foreground workers of type ‘short service’ and ‘data sync’ timing out and causing an ANR when WorkManager didn’t call stopSelf(). This fix only applies to devices with API 34 and 35 where foreground service types were introduced. (ca06b2, b/364508145)
Krishnasony commented 2 months ago

Relates to https://developer.android.com/jetpack/androidx/releases/work#2.10.0-alpha03

  • Fix foreground workers of type ‘short service’ and ‘data sync’ timing out and causing an ANR when WorkManager didn’t call stopSelf(). This fix only applies to devices with API 34 and 35 where foreground service types were introduced. (ca06b2, b/364508145)

Is this Fixed The background ANR coming on Android 14?

davidmotson commented 2 months ago

@stepushchik @RelappsStudio @mandeep1999

Can you provide more details on the error you're experiencing? For example:

Are there other, similar ANRs being reported for your app? (same error, binder transaction, but a different stack trace) Have you been able to reproduce these personally? If so, under what conditions? Are you also seeing this mainly/exclusively on oppo/one+ devices?

Gowthamr2130 commented 2 months ago

Hi @davidmotson Any Update on this: https://github.com/firebase/firebase-android-sdk/issues/6264

And let us know if there are any temporary solutions for this.

Deepakkast commented 2 months ago

any update on the above

Screenshot 2024-09-19 at 3 18 41 PM
lehcar09 commented 2 months ago

Hi folks! sharing this update here from the original issue (#4345).

From mrober:

Hey everyone,

Sorry for the delay. We have been working on this, and have landed large changes to the SDK.

Here is a beta version of Crashlytics SDK 19.2.0 which resolves several of the issues that have been causing ANRs. This version is different from 19.1.0 which will be released in the next day or so. 19.1.0 only includes some of these changes, but has gone through more validation.

If you want to try the 19.2.0 beta, follow these steps:

Download the artifact m2repository.zip and extract it to a local repo directory.

Add something like the following to your app's gradle build files:

// settings.gradle.kts
dependencyResolutionManagement {
  repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
  repositories {
    maven(uri("/path/to/local/repo/"))
    google()
    mavenCentral()
  }
}
// app/build.gradle.kts
dependencies {
  implementation("com.google.firebase:firebase-crashlytics:19.2.0-beta01")
  // ...
}

The CHANGELOG is:

* [feature] Added the `isCrashlyticsCollectionEnabled` API to check if Crashlytics collection is enabled.
* [fixed] Ensure that on-demand fatal events are never processed on the main thread.
* [fixed] Improved data consistency for rapid user actions.
* [changed] Internal changes to improve startup time.
* [changed] Internal changes to the way session IDs are generated.
* [changed] Internal changes to the way background tasks are scheduled.
* [changed] Migrated SDK to use standard Firebase executors.

If you don't want to try the beta, you can wait for the stable 19.1.0 release in the next few days, which includes some of these changes. The full 19.2.0 release is undergoing further internal validation.

JonathanKranz commented 2 months ago

WX20240924-101502@2x WX20240924-101522@2x WX20240924-101536@2x

Deepakkast commented 1 month ago

Any update on the above

dhruvkaushal11 commented 1 month ago

Hi Team, we are facing this same issue, ie on Android 14 & numbers are in millions. Can someone share the expected ETA for this?

hakanbagci commented 1 month ago

Hi Team,

This issue has become our top ANR and it is mostly happening for Android 14 Oppo Devices.

Screenshot 2024-10-04 at 09 41 52 Screenshot 2024-10-04 at 09 42 09

@mrober Could you please share an expected ETA about the fix if possible?

stepushchik commented 1 month ago

Relates to https://developer.android.com/jetpack/androidx/releases/work#2.10.0-alpha03

  • Fix foreground workers of type ‘short service’ and ‘data sync’ timing out and causing an ANR when WorkManager didn’t call stopSelf(). This fix only applies to devices with API 34 and 35 where foreground service types were introduced. (ca06b2, b/364508145)

Is this Fixed The background ANR coming on Android 14?

Unfortunately, ANRs still present after updating to androidx.work:work-runtime:2.10.0-alpha04 and com.google.firebase:firebase-crashlytics:19.2.0

Please check in your project if you can.

khaykov commented 1 month ago

@stepushchik Fix for Android 14 specific ANR's is not a part of 19.2.0, according to this, so we have to wait.

ashishkrishnan commented 1 month ago

Folks do we have ETA for the fix? Will this be fixed on the Firebase libraries?

davidmotson commented 1 month ago

After some investigation here are the results:

  1. Previous to Android 14, these app closures still occurred, but where not reported as ANRs, so while it may be worrying to see this ANR spike in Android 14, your user's behavior is not degraded.
  2. This ANR is also not caused by hitting the timeout behavior change in Android 14, Firelog's use of JobScheduler does very little on the main thread, and in a test we ran swapping to WorkManager didn't improve the situation as WorkManager is implemented on top of JobScheduler, however performance was heavier weight, so we decided to not ship that change.
  3. The vast majority (>99%) of these ANRs are background ANRs. It is important to note that background ANRs are not user visible. The user experience of a background ANR is that the app takes slightly longer to turn on the next time they launch it. Also, background ANRs are relatively common, because a backgrounded app will be running with very limited access to memory and CPU time.
khaykov commented 1 month ago

@davidmotson While the user experience might not be visibly affected, Google Play will penalize the apps with excessive crash rates, which this issue significantly contributes to. I believe it should not be closed, but fixed instead.

davidmotson commented 1 month ago

@davidmotson While the user experience might not be visibly affected, Google Play will penalize the apps with excessive crash rates, which this issue significantly contributes to. I believe it should not be closed, but fixed instead.

According to this only user perceived ANRs should penalize your app.

Deepakkast commented 1 month ago

this should be fix otherwise it will create performance issue on play console

hakanbagci commented 1 month ago

@davidmotson While the user experience might not be visibly affected, Google Play will penalize the apps with excessive crash rates, which this issue significantly contributes to. I believe it should not be closed, but fixed instead.

According to this only user perceived ANRs should penalize your app.

Hi @davidmotson,

Unfortunately a significant number of this issue is user perceived ANR. Please check the screenshot from Google Play Console below. This screenshot clearly indicates that there are significant number of user perceived ANRs from this issue.

Having this issue happening before Android 14 but not being reported should not be a reason for closing the issue. I think our take should be the opposite. Since the issue is identified and has clear affect on the users, we should plan to fix it.

There are 14 down votes to the message for the decision of not fixing the issue. This clearly shows that there is a clear frustration from this ANR for several products/teams. For our 9 applications on Google Play with millions of users, this ANR is always in top 3 in Google Play user perceived ANRs.

Could you please reassess the issue and open it again?

Screenshot 2024-10-18 at 23 27 59
ashishkrishnan commented 1 month ago

It is important to note that background ANRs are not user visible. The user experience of a background ANR is that the app takes slightly longer to turn on the next time they launch it.

@davidmotson Finding this rationale very odd - given the experience of the users will get impacted. Cold start/warm starts will be high and is not a good experience. I am not just worried about the metric, penalty and indirect cost in loss of user base - I am worried about the experience it has on end users.

Open to brainstorm ideas for a fix. Could you give more details on the challenges?

JonathanKranz commented 1 month ago

Hi @davidmotson

I carefully read the information provided by @hakanbagci, I also encountered this problem, and it is TOP 2 Firebase's vision is to help developers make better applications, but now it seems that this is not the case; this problem has obviously affected the quality and exposure of the App. The Google Player market will punish these applications, but now it is so careless, which goes against the original intention of the Firebase team; if this is the case, developers have to choose other alternatives. WX20241021-104553@2x

urbeastfriend commented 1 month ago

Same stuff

dhruvkaushal11 commented 1 month ago

9.04K ANR and User Perceived.

image
davidmotson commented 1 month ago

@hakanbagci @JonathanKranz @dhruvkaushal11 @urbeastfriend

This would definitely be new information. Can you provide more details? I notice in your screenshot the commonality of this ANR is in ART, is there a significant presence of datatransport in reporting stacktraces as well?

@JonathanKranz @hakanbagci In your previous screenshots I see some references to JobScheduler, but nothing firebase related, is there more information you can provide?

ankit92 commented 1 month ago

Hi folks! sharing this update here from the original issue (#4345).

From mrober:

Hey everyone, Sorry for the delay. We have been working on this, and have landed large changes to the SDK. Here is a beta version of Crashlytics SDK 19.2.0 which resolves several of the issues that have been causing ANRs. This version is different from 19.1.0 which will be released in the next day or so. 19.1.0 only includes some of these changes, but has gone through more validation. If you want to try the 19.2.0 beta, follow these steps: Download the artifact m2repository.zip and extract it to a local repo directory. Add something like the following to your app's gradle build files:

// settings.gradle.kts
dependencyResolutionManagement {
  repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
  repositories {
    maven(uri("/path/to/local/repo/"))
    google()
    mavenCentral()
  }
}
// app/build.gradle.kts
dependencies {
  implementation("com.google.firebase:firebase-crashlytics:19.2.0-beta01")
  // ...
}

The CHANGELOG is:

* [feature] Added the `isCrashlyticsCollectionEnabled` API to check if Crashlytics collection is enabled.
* [fixed] Ensure that on-demand fatal events are never processed on the main thread.
* [fixed] Improved data consistency for rapid user actions.
* [changed] Internal changes to improve startup time.
* [changed] Internal changes to the way session IDs are generated.
* [changed] Internal changes to the way background tasks are scheduled.
* [changed] Migrated SDK to use standard Firebase executors.

If you don't want to try the beta, you can wait for the stable 19.1.0 release in the next few days, which includes some of these changes. The full 19.2.0 release is undergoing further internal validation.

@davidmotson @lehcar09 @mrober , yesterday Firebase BOM - 33.5.0 is released that also includes crashlytics NDK version 19.2.1. Does this release contains fix for Android 14 ANR's as mentioned in above comment ?

hakanbagci commented 1 month ago

@hakanbagci @JonathanKranz @dhruvkaushal11 @urbeastfriend

This would definitely be new information. Can you provide more details? I notice in your screenshot the commonality of this ANR is in ART, is there a significant presence of datatransport in reporting stacktraces as well?

@JonathanKranz @hakanbagci In your previous screenshots I see some references to JobScheduler, but nothing firebase related, is there more information you can provide?

Hi @davidmotson,

After some more investigation I found out that the main thread is blocked because QueuedWork.processPendingWork is forcing SharedPrefencesImpl to process waiting preferences persistence task to be done in sync. Here is an example from Google Play stack trace for that kind of ANR.

Hypothesis:

Large data operations or complex object serialization within Shared Preferences can lead to significant processing time on the main thread, especially when apply() is used incorrectly or excessively. apply() asynchronously writes to disk, but QueuedWork.processPendingWork might force these pending writes to be completed synchronously (waiting them to complete with a lock) when activity is paused etc, leading to the observed ANRs.

Following this hypothesis maybe we can ask the question: Does Firelog or Firebase Crashlytics rely heavily on Shared Preferences for internal operations, such as caching log data or crash reports? Could their internal usage patterns be contributing to the ANR occurrences?

Screenshot 2024-10-22 at 14 12 21
davidmotson commented 1 month ago

@hakanbagci Thank you for you response. I don't think anything in this strack trace points to something in firebase being an issue. Firelog and Crashlytics don't use SharedPreferences in a heavyweight way. Our code is also open sourced, so if you want to use that to verify assumptions or otherwise assist your debugging process definitely feel free to do so. Regardless, it seems to have nothing to do with JobScheduler, so if you find more info on this please open another issue.

@ankit92 Please see this comment for an update.

hakanbagci commented 1 month ago

@hakanbagci Thank you for you response. I don't think anything in this strack trace points to something in firebase being an issue. Firelog and Crashlytics don't use SharedPreferences in a heavyweight way. Our code is also open sourced, so if you want to use that to verify assumptions or otherwise assist your debugging process definitely feel free to do so. Regardless, it seems to have nothing to do with JobScheduler, so if you find more info on this please open another issue.

@ankit92 Please see this comment for an update.

Opened a new issue for firebase performance. It looks like this issue (at least the issue that I shared the stack trace) is most probably related to trace function from firebase performance where each trace function call is causing shared preference to write to a file. This should work on IO in theory, but QueuedWork forces it to work on main thread in some scenarios like activity is stopped and there are shared preferences waiting to be committed. In our app, firebase performance was excessively calling shared preferences apply function (In 5 seconds almost ~450 times, our app start to interactive flow). Then the pending apply tasks where being executed in serial when activity is stopped. So the trace function itself is not performant and has a side affect on performance.

Please check if this is the case for you as well. https://github.com/firebase/firebase-android-sdk/issues/6407

mikelhm commented 2 weeks ago

this is perceptible ANR in GP,

there are two cases that use JobScheduler:

1)com.google.android.datatransport.runtime.scheduling.jobscheduling.JobInfoScheduler 2)com.google.firebase:firebase-analytics -> com.google.android.gms:play-services-measurement->zzks.class AppMeasurementJobService->JobScheduler"

Firebase should do something instead of close this issue @davidmotson