Finish and capture transaction/span bound to the Scope in case of a fatal crash

kahest commented 2 years ago

Description

For now the transaction gets lost in case of a fatal crash. We should finish it with an appropriate status before capturing the exception, if possible.

See https://github.com/getsentry/team-mobile/issues/62

philipphofmann commented 2 years ago

At the moment, I can think of two ways to achieve this.

Option A

In case of a fatal crash, we could call into ObjC to finish the ongoing transactions after SentryCrash has written the crash report to disk. We already use this approach for screenshots and view hierarchy, but it wouldn't work for all types of crashes. This would be the best effort approach. https://github.com/getsentry/sentry-cocoa/blob/58f558dc58731af7d6522987fe5e95b0440025a0/Sources/SentryCrash/Recording/SentryCrashC.c#L111-L117

Option B

To make this work for all types of crashes, we would need to sync the transaction from ObjC to C memory and store it alongside a crash report. This approach is similar to syncing the scope data to SentryCrash. https://github.com/getsentry/sentry-cocoa/blob/58f558dc58731af7d6522987fe5e95b0440025a0/Sources/Sentry/include/SentryCrashScopeObserver.h#L6-L19

This approach would add some overhead, as every change to a span or transaction would trigger a sync to c memory. To keep this performant a fine-grained sync is a must, which only syncs the current change of a span or transaction to SentryCrash, instead of the whole transaction.

brustolin commented 2 years ago

I believe this is an edge case that don't worth doing option B.

github-actions[bot] commented 1 year ago

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

aellett commented 3 weeks ago

Has anyone revisited this lately? Was surprised this morning when I dug into this functionality and saw that it was different between our iOS and Android (sentry-java) apps.

philipphofmann commented 3 weeks ago

@aellett, we haven't revisited this lately. It's still on our backlog.

This is different because you can still call Java code when your application terminates due to a crash on Java. On Cocoa, strictly speaking, we must only call C code once the app crashes, but we sometimes have to partially bend the rules of what is acceptable. For more information check our develop docs for signal handlers. We already do that to add screenshots and view the hierarchy to crashes. Therefore, I vote for option A. We will revisit the priority for this, but I can't give you an ETA.

aellett commented 2 weeks ago

@aellett, we haven't revisited this lately. It's still on our backlog.

This is different because you can still call Java code when your application terminates due to a crash on Java. On Cocoa, strictly speaking, we must only call C code once the app crashes, but we sometimes have to partially bend the rules of what is acceptable. For more information check our develop docs for signal handlers. We already do that to add screenshots and view the hierarchy to crashes. Therefore, I vote for option A. We will revisit the priority for this, but I can't give you an ETA.

Yeah, option A sounds reasonable to me. I saw mention above that it wouldn't work for some types of crashes. Can you tell me at a high level what types of crashes it wouldn't cover?

philipphofmann commented 2 weeks ago

@aellett, can you please describe your use case for this feature a bit? Could it be that fixing https://github.com/getsentry/sentry-cocoa/issues/4375 solves your underlying problem? As this feature would bend the rules of signal handlers a bit, we're reluctant to add it even behind a feature flag.

I saw mention above that it wouldn't work for some types of crashes. Can you tell me at a high level what types of crashes it wouldn't cover?

I can't recall precisely which types of crashes. You aren't allowed to call the ObjC code when you're crashing and receiving a signal from your signal handler. In some cases, it might work, and in others not. When we tested this for view hierarchy and screenshots, it usually worked. The worst thing that can happen is losing the transaction, and signal handlers registered after ours won't run correctly. We ensure that we only call the sync safe code until the SDK has written the crash report to disk so we don't lose that important information.

aellett commented 2 weeks ago

@aellett, can you please describe your use case for this feature a bit? Could it be that fixing #4375 solves your underlying problem? As this feature would bend the rules of signal handlers a bit, we're reluctant to add it even behind a feature flag.

What we're looking for is that we want to measure the availability (for lack of a better word) of a given transaction. How many times did the flow within the transaction fail relative to how many times it ran. For our purposes, I think we'd count a failure as either a fatal crash or a call to SentrySDK.capture(error). We can handle the second case in a couple different ways, but I don't see a way (currently) to include fatal crashes in that calculation.

Regarding #4375, I don't think it would help us to calculate this, because I think the proposed solution was to not set the transaction field for fatal crashes. If that field were set to the name of the transaction that was running during the crash, we could try to count those errors, but then I think we'd have to do some extra math to get the availability of that transaction and I don't think we could do that within a dashboard or something. Even if we did have that association, I might still be a little concerned because we sample our transactions and errors at different rates, so I'd be worried that our availability would be off.

I can't recall precisely which types of crashes. You aren't allowed to call the ObjC code when you're crashing and receiving a signal from your signal handler. In some cases, it might work, and in others not. When we tested this for view hierarchy and screenshots, it usually worked. The worst thing that can happen is losing the transaction, and signal handlers registered after ours won't run correctly. We ensure that we only call the sync safe code until the SDK has written the crash report to disk so we don't lose that important information.

Thanks for the extra detail.

philipphofmann commented 2 weeks ago

Ah, now I think I get it. So you want the failure_rate in Sentry to include crashes? Or do you measure the availability of your transactions?

That's a valid use case, and it's worth the effort. I will discuss this with the team and get back to you.

You could achieve something like this by always storing the current transaction data in the scope. The SDK stores the scope with the crash report. Then, on the next app launch, you could check in before if the event is a crash and then manually recreate parts of the transaction and mark it as errored by grabbing that information from the crash event. But you also might want to remove this extra scope data for all the other events. That's quite hacky, but it could work until we get to this. I can explain this in more detail if that sounds like something you want to do.

philipphofmann commented 2 weeks ago

Another idea: we could go with option B but only serialize parts of the transaction. Maybe only the root span, including some metadata, could be enough.

aellett commented 1 week ago

Or do you measure the availability of your transactions?

This. We want to wrap a journey through a feature in a transaction, and then measure the availability of that transaction. Whenever the journey has an error (doesn't matter if it's handled or unhandled), that should count against the availability. Sort of tangential to this: as part of this effort, we're trying to figure out the best way to set the status on a root transaction when a handled error happens. I was thinking that it (the root transaction) would have a status of something other than .ok after I called SentrySDK.capture(error), but that didn't seem to be the case while I was testing it.

Maybe only the root span [...], could be enough.

I think that this would be enough for what we're trying to do. When there is a crash I mostly care that 1) I can see the error in Sentry and 2) There was a transaction with the correct name that has a status of something besides .ok. The first one is obviously already happening, so any movement towards the second one would be brilliant.

philipphofmann commented 1 week ago

We plan to start with option A behind a feature flag. If it doesn't work correctly, we can reconsider option B.

getsentry / sentry-cocoa