getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
37.51k stars 4.04k forks source link

Crashed Session Rate calculation #72420

Open kerenkhatiwada opened 3 weeks ago

kerenkhatiwada commented 3 weeks ago

Environment

SaaS (https://sentry.io/)

Steps to Reproduce

  1. Calculate the number of unhandled errors
  2. divided the unhandled errors by the number of total sessions x100 ex. 262/3054=0.0857x100=8.57
  3. Observe the release page shows 5.894% for the Crashed Session Rate Release in the Jira ticket. Screenshot 2024-06-10 at 11 36 21 AM Screenshot 2024-06-10 at 11 36 26 AM

Expected Result

The Crashed Session Rate would be 8.57

Actual Result

The Crashed Session Rate is 5.894

Product Area

Releases

Link

No response

DSN

No response

Version

No response

┆Issue is synchronized with this Jira Improvement by Unito

getsantry[bot] commented 3 weeks ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 3 weeks ago

Routing to @getsentry/product-owners-releases for triage ⏲️

jonbooth-serato commented 3 weeks ago

Thanks @kerenkhatiwada for raising this issue on my behalf.

One thing I did notice is that the docs suggest that using crashpad on mac can make the release health stats look odd. We are using crashpad on mac (and windows too) but unless the way it effects the stats is somewhat undeterministic then it doesn't account for the anomoly as over 90% of crashes happen to be on mac in this release.

schew2381 commented 3 weeks ago

Hi there, errors and crashed sessions do not correlate 1-1 so it's expected that the Crashed Session Rate does not equal that calculation.

This doc goes over some of the nuance, and for more information you can see the page specifically for Release Health which goes over crash-free session rates.

If you have any more questions, feel free to msg again in this issue!

jonbooth-serato commented 3 weeks ago

Thanks @schew2381 , I've read all that documentation but none of it answers my question really as it's really vague as to what those numbers mean.

We have a native application that only sends fatal errors to sentry and does so using the crashpad back end - no other events are sent right now.

There is no rate limiting or dropped crashes, although if somehow we were missing events, there was this would make me expect the crashed rate reported to be greater than that calculated by the number of fatal events (or in my case events) reported.

There is the possibility that after shutting down the crash-reporting back-end, there is a crash before the sentry session is closed, however, I'm assuming this is all done during sentry_close() and so therefore again is both unlikely but also like above would make the reported crash rate greater than the calculated one.

Since the crash rate is less than the reported one I conclude that neither of these things are happening.

As I mentioned before (and now can't find it in your documentation) there is a note about macOS with crashpad not always reporting correctly that a session has crashed and instead it appearing as abnormal. As then for the case above, we're seeing a total crashed rate of 6.5+5.9 = 12.4 and the calculated rate is indeed lower than this at 8.6. That either means we dropping a heap of crash events, or there is something else causing abnormal session exits. As this is a predminiantly a laptop environment would, for instance a native app on a laptop going to sleep cause an abnormal session?

Do you know if on macOS/crashpad it will always report as abnormal for a crash - as I don't think with the stats above that's possible.

The reason I'm asking is, without understanding how these stats are actually caculated and what contributes to them I don't have much confidence I know what I'm tracking.

Alternatively is there a possibility of adding to the user interface the crash rate as calculated by crashes submitted/session count so we can see trends on that statistic.

leedongwei commented 3 weeks ago

Hi! I'm reaching out internally to the original engineers that built the feature for more context..

schew2381 commented 2 weeks ago

@jonbooth-serato Hi there sorry for the delay, I reached out to some people and they directed me to the docs you mentioned before.

It appears that the crashpad backend on macOS cannot reliably determine the status of crashed session. I tried to find specifically the technical reason behind this, but the most I could find was this explanation here in the SDK.

Some more digging reveals crashed session rate for mac appears to be using a heuristic described in https://github.com/getsentry/sentry-native/pull/344 and merged in https://github.com/getsentry/sentry-native/pull/335