Investigating Bugsnag OOM Crashes

Augustyniak commented 3 years ago

Description

It's hard to tell what the reason for a given Bugsnag OOM crash was - whether a crash was happened by a memory leak, a retain cycle or just a high memory usage caused but the lack of optimizations.

Describe the solution you'd like A way to tell whether a given OOM crash was a result of normal app usage (where the applications just happens to consume too much memory because of the lack of optimizations) or whether there is a memory leak / retain cycle somewhere in the app.

Describe alternatives you've considered Using Xcode instruments to profile the app - one can profile only a small subsets of all of the possible applications' configurations that production users experience. Looking at a Bugsnag report - even with a lot of breadcrumbs in it - it's hard to replicate the state a user was in and be able to tell whether their OOM crash was a result of a retain cycle / memory leak.

Additional context

I do not have a clear idea for how this could be implemented but I wonder whether Bugsnag team has any suggestions / tips / plans for features that could make it easier to detect whether a given OOM crash was a result of a memory leak or a retain cycle.

mattdyoung commented 3 years ago

Hi @Augustyniak

Thanks for your thoughts. The challenge with OOM detection on iOS is that Apple doesn't provide an event hook for out of memory events. Bugsnag relies on our own heuristic to identify during app re-launch whether the previous termination of the app was an unexpected termination by the OS watchdog. We can identify several other known reasons the OS may kill the app such as device reboot, app upgrade, app hang on main thread, and we send an Out Of Memory error report that the app was likely terminated by the operating system while in the foreground if all other detectable causes have been ruled out.

For an OOM crash, we can't run any code at crash time and it's hard to predict in advance when a termination is likely to happen. So in terms of capturing breadcrumbs leading up to an OOM it's a balance between what would be useful to diagnose and what would be resource heavy to capture, as we don't want to have any significant impact app performance where the app may not terminate.

In theory leaks and retain cycles could be identified by scanning the heap memory, but we'd need to suspend threads when scanning which would cause a noticeable hang, and we'd need to determine when to perform a scan (possibly relying on memory warning notifications). And even if we detect a leak, identifying the root cause might need the stack trace at time of allocation and memory graph which can't reasonably be tracked in production apps.

We'd suggest continuing to consider which breadcrumbs are most useful to capture the application state, such as those to track the view controller lifecycle, and trying to replicate that state when profiling in Xcode.

We're discussing this internally to consider whether there is any other information we could feasibly capture in OOM reports to help with diagnosing these issues.

Augustyniak commented 3 years ago

Thank you for your response @mattdyoung.

Out Of Memory error report that the app was likely terminated by the operating system while in the foreground if all other detectable causes have been ruled out

Do you have any estimate for the reliability of detection of "Out Of Memory" crashes? Do false positives happen and how often do they happen (percentage-wise)?

When it comes to scanning heap to detect cycles / leaks - thank you for the explanation. I agree that it's probably not worth it if it impacts the performance of an app.

Some of the ideas for how to increase the visibility into OOM crashes:

Add information about how much memory the application uses to Bugsnag crashes. This would allow us to compare the amount of memory used at the time of the OOM crash with the expected value of memory used by our app.
Add information about how much memory application uses to breadcrumbs for "app received memory warning". This could be helpful especially in cases when we can see multiple memory warnings in crash's breadcrumbs I think to see whether the application continues to consume more and more of memory.

I realize that both of these could be implemented by a customer of Bugsnag with the use of your public API but it may be worth adding to the SDK itself if it improves the experience of working with OOM crashes.

mattdyoung commented 3 years ago

Do you have any estimate for the reliability of detection of "Out Of Memory" crashes? Do false positives happen and how often do they happen (percentage-wise)?

No, we're not able to capture data on the different cases ourselves in real-world apps. We suspect there will be some terminations captured as "Out Of Memory" which aren't related to low memory e.g. the OS terminating the app due to the device overheating would have the same signature.

Thanks for the ideas! We are already considering what other diagnostics we can add to make OOM crashes more actionable and intend to add these to the default behavior of the SDK itself. This is likely to include additional breadcrumbs since for OOMs we can't snapshot diagnostics at crash time, so capturing memory usage or other state information in breadcrumbs leading up to a crash will prove most useful.

sethfri commented 3 years ago

I'm curious to hear what the status is on improving OOM diagnostics. My team receives quite a lot of them but has been having a difficult time root causing many.

Part of the problem is OOMs seem to be grouped together despite having wildly dissimilar stack traces. Have you considered modifying the grouping on these (app hangs have the same issue)?

luke-belton commented 3 years ago

Hey @sethfri

We've just released a new version of bugsnag-cocoa which detects Thermal Kill errors (where the OS terminates an app due to a device overheating). This was released in v6.12.0. These Thermal Kill errors would previously have been grouped together with OOMs, so you can now detect when devices have crashed as a result of a thermal critical condition.

When OOMs occur, we can't capture a stacktrace so generally we advise looking at breadcrumbs etc. to understand what was happening in the app in the lead up to the event.

For app hangs we do capture stacktraces and the events should be grouped accordingly. If you're seeing app hang events that you believe are not grouped correctly please could you write into support@bugsnag.com with links to some examples and we'd be happy to take a look for you?

firatagdas commented 3 years ago

I think most of the OOM crashes on Bugsnag are not accurate. I know that because we used Firebase Crashlytics in parallel to Bugsnag. Crashlytics caught the exact crash, but Bugsnag only referred to it as OOM. I love Bugsnag, but we can't rely on it.

You may need to revisit OOM reports IMO.

mattdyoung commented 3 years ago

@firatagdas That sounds strange. Could you email support@bugsnag.com with details of this crash as captured by Crashlytics so we can investigate and try to reproduce the issue?

hovox commented 3 years ago

hi @firatagdas . We are also using Crashlytics and Bugsnag, would be good to know in which cases Crashlytics handles crashes and Bugsnag not. We have tested in different scenarios and seems they work same way in case of non-oom crashes.

firatagdas commented 3 years ago

Hello, @hovox and @mattdyoung. I’ll prepare a case when i am available. But pretty busy this time around.

I know one of the case is accessing [[VungleSDK sharedSDK] currentSuperToken] in another thread before waiting VungleSDK initialization.

Vungle SDK is an Ad SDK. I’ll try to reproduce the issue.

hovox commented 2 years ago

Hey Bugsnag team, maybe it is reasonable to increase breadcrumbs max count (e.g. to 200) in case of OOM? Since we do not have stacktraces, we may need more info and hence longer time breadcrumbs for OOMs.

mattdyoung commented 2 years ago

Hey Bugsnag team, maybe it is reasonable to increase breadcrumbs max count (e.g. to 200) in case of OOM? Since we do not have stacktraces, we may need more info and hence longer time breadcrumbs for OOMs.

Hi @hovox - we are considering improvements to allow more breadcrumbs in general in the future, so I've flagged this OOM use case to consider as part of that analysis.

nickdowell commented 2 years ago

Hey Bugsnag team, maybe it is reasonable to increase breadcrumbs max count (e.g. to 200) in case of OOM? Since we do not have stacktraces, we may need more info and hence longer time breadcrumbs for OOMs.

In v6.22.0 we increased the default and maximum values for maxBreadcrumbs to 100 and 500, respectively.

bugsnag / bugsnag-cocoa

Investigating Bugsnag OOM Crashes #1145

Description