getsentry / sentry-java

A Sentry SDK for Java, Android and other JVM languages.
https://docs.sentry.io/
MIT License
1.14k stars 432 forks source link

Ability to filter by "user-perceived" unhandled exceptions/fatal errors generated by the Kotlin SDK #3425

Open arifken opened 4 months ago

arifken commented 4 months ago

Problem Statement

Due to the behavior of coroutines exception handling, a failure to catch the exception can sometimes result in a crash, if the request was made from certain contexts, and can be swallowed when made from others.

In both of these cases, Sentry labels the error as level:fatal with exception.handled:false. However, this creates the possibility for a spike in "crashes" that do not actually crash the app for the user.

We recently encountered this issue with a android_getaddrinfo exception, where a user was trying to make a network connection while in airplane mode. What we saw is a huge spike in errors in Sentry, but when we went to Google Play, we were able to filter by user-perceived errors and see that the actual count of crashes for this issue was very very low.

It appears that Sentry's "Crash free user rate" takes into consideration fatal errors that are "user perceived" vs. not (the spike in errors did not impact our crash free user rate).

Solution Brainstorm

We'd love to be able to query by user_perceived:true in Discover, Dashboards, and Alerts (for error data sources) so that we can differentiate between user-perceived vs non-main thread crashes ourselves.

when specifying user_perceived:true, I would expect to only see crashes that resulted in the Android app closing, and contribute to the crash-free user rate shown by Sentry.

when specifying user_perceived:false level:fatal, I would expect to only see errors that are swallowed up by the parent coroutine and do not actually crash the app for the user.

Product Area

Issues

┆Issue is synchronized with this Jira Improvement by Unito

getsantry[bot] commented 4 months ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 4 months ago

Routing to @getsentry/product-owners-issues for triage ⏲️

anthonycr commented 4 months ago

I want to provide more context on this, as I was wrong when I initially suggested that it was the coroutines exception handler was the primary cause of this observability hole. I had more time to do debugging this week and fully identify the cause of the crash. Here's the scenario:

Let's say we have Activity A and Activity B. Activity A is the app's main activity, and Activity B is a secondary activity. A user action triggers a network request inside a coroutine in Activity B. That network request fails, and the coroutine crashes. That crash DOES get routed to the global coroutines exception handler. However, where I was wrong, is that the coroutine exception wasn't crashing the app but was getting routed to Sentry.

What is actually happening, is that because no local coroutine exception handler exists, the global coroutines exception handler re-throws the exception, and crashes Activity B. This is how Sentry is receiving the crash. However, the app itself does not terminate, so the Sentry session does not close, but remains open. The Android OS sees that this crash happened on Activity B, and was user triggered, and tries to recover by restarting Activity A. Since Activity A does not have a bug, the app is able to recover. In our case, since the error was sporadic and triggered by flaky network, the user usually did not re-trigger the crash, so the session never got ended, and the session crash free rate never went down.

This leads me to the question: Would it be possible to have two crash free rates? One which calculates the number of sessions that experienced a crash and a second one (the same as the current crash free rate) that calculates the number of sessions that ended with a crash. This would allow Sentry to provide the more nuanced context that the Play Store console shows.

vartec commented 4 months ago

I believe that to get this working in Issues we must first get it as a tag from the SDK, so I'm transferring this issue.

kahest commented 3 months ago

@anthonycr @arifken we report fatally crashed sessions as crashed, and sessions that experience (potentially multiple) non-fatal errors as errored. You should be able to see the "Error free session rate" by looking at the Release Details. Crashing coroutines should be counted towards errored sessions.

Also, can you confirm if this is about the Kotlin Multiplatform SDK?

romtsn commented 3 months ago

However, the app itself does not terminate, so the Sentry session does not close, but remains open.

I'm not sure if this is correct, because we actually finish the session if our exception handler encountered a crash, see here: https://github.com/getsentry/sentry-java/blob/84892a8db59d768e89a04a05439f5198c162301b/sentry/src/main/java/io/sentry/SentryClient.java#L540-L543

I'm still not sure what the original request of this issue is: would you like to distinguish between user-perceived crashes vs not, or would you like us to fix the coroutine exception handling? I'm a bit lost I guess, sorry

romtsn commented 3 months ago

also for the reference, this is what Google Play means by "user-perceived": https://developer.android.com/topic/performance/vitals/crash#android-vitals

arifken commented 3 months ago

The primary goal is to be able to distinguish between user-perceived crashes vs. not (ideally some sort of a query option we could use in discover, dashboards, and alerts)

kahest commented 3 months ago

You can query for fatal crashes happening while the app was in foreground (which is the main factor used to define user perceived) like so: level:fatal app.in_foreground:True. Does this help?

anthonycr commented 3 months ago

I'm not sure if this is correct, because we actually finish the session if our exception handler encountered a crash, see here

we were seeing a scenario where the crash was actually happening while the app was in the foreground, but the application itself was not terminated. that is, Activity 2 was stacked on top of Activity 1, and a user interaction caused a crash in Activity 2, which triggered Activity 1 to restart. For some reason, it seems that is not treated as a "user perceived" crash.

Maybe there's a way for us to create a sample project to demonstrate the issue, would that be helpful? The google docs on user perceived crashes seem clear, but I think they're obfuscating what the Android OS is actually doing.

You can query for fatal crashes happening while the app was in foreground (which is the main factor used to define user perceived) like so: level:fatal app.in_foreground:True. Does this help?

I'll defer to @arifken on this.

romtsn commented 3 months ago

Maybe there's a way for us to create a sample project to demonstrate the issue, would that be helpful? The google docs on user perceived crashes seem clear, but I think they're obfuscating what the Android OS is actually doing.

Yeah, a sample project is always welcome, would definitely help us investigate, thank you!

markushi commented 3 months ago

We definitely can look into the new-ish behavior when multiple activities are on the stack and a crash just closes the topmost one, or in other words the app is not terminated.

anthonycr commented 2 months ago

Attached is a zip file of the test case. The example I created is an app consisting of 2 activities.

MainActivity has 2 buttons, a Crash button, and a Second Screen button. The SecondaryActivity has a Crash button. Both Crash buttons do the same thing, invoke the kotlin function error("oops"), which will throw an exception on the current thread. The Second Screen button will open SecondaryActivity.

When the Crash button is clicked on MainActivity, the app will close and show that it's "crashed." The app process is terminated, and is not restarted (Check logcat for PROCESS ENDED and PROCESS STARTED). This presumably ends the session and is counted as a user visible crash.

When the Crash button is clicked on SecondaryActivity, the app will not close. Instead, the screen goes blank for a second and does not show the app crash dialog. Instead, what happens is that the MainActivity is shown again. Looking in logcat, we can see that the app process is terminated, and immediately recreated without any user intervention. I think this crash is being reported as a background crash. What's interesting is that the behavior is different if I trigger crash 1 before crash 2. If I trigger crash 1 first, then crash 2 seems to terminate the app without restarting it.

Crash.zip

kahest commented 2 months ago

related product issue: https://github.com/getsentry/sentry/issues/73523