getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.86k stars 4.17k forks source link

Support native crash dumps uploaded in minidump_stackwalk format #20298

Closed milianw closed 4 years ago

milianw commented 4 years ago

Summary

In addition to uploading raw minidump files, it would be ideal if pre-processed raw stack traces could be uploaded instead.

Motivation

The main problem with the current support for native crash dumps is that it requires uploading of each and every DSO that's encountered in any frame of the stack trace. As soon as this isn't the case, the really unreliable stack scanning approach is used, which is usually leading to totally bogus stack traces.

If the minidump files would be unwound locally (cf. minidump_stackwalk) and then the raw stack trace uploaded to sentry, the stack trace should be correct. Sentry would then only be responsible for the symbolication.

This allows another potential improvement: One doesn't need to upload the full (large) debug DSO files to sentry. Instead, it would be enough to upload the (compressed) output of dump_syms. That would solve a second problem we are seeing - namely the immense amount of disk space required to host all of the debug symbols to support multiple versions of our large native application.

Additional Context

Take a typical Linux application that's provided as an AppImage: One would not bundle the libc.so within that AppImage. If we hit a crash in libc.so (e.g. invalid free or similar), then sentry would fail to create a proper backtrace, unless one uploads the libc.so from all supported linux distributions.

If instead the unwinding is done locally, all DSOs are available and the .eh_frame section can be used to create a fully unwound stack. This would only contain instruction pointer addresses (IPs). Together with the memory map which is included in the output of minidump_stackwalk, sentry can then map the absolute IPs into relative DSO offsets and lookup the correct dump_syms file with the matching build-id to find the symbol name, and potentially add inline frames too.

If you have any questions to this, I'm willing to help you out. As the author of heaptrack and hotspot, I've learned a few things about how unwinding and symbolication works. Sadly, the current status of sentry's native crash support makes it often impossible to investigate crash reports, as the stacks are completely bogus. The suggestion above should alleviate the problem.

Otherwise: Thanks a lot for creating Sentry - it has a lot of potential!

jan-auer commented 4 years ago

Hi @milianw. Thanks for this suggestion and the detailed context you're providing. I believe that what you're looking for can already be achieved today -- although in a slightly altered form.

Immediate Solution

Besides Minidump files, Sentry supports uploading errors and crashes in our JSON event payload format. This schema has the highest stability guarantees at Sentry, and we retain backwards compatibility. This is used by all our SDKs, including native SDKs that perform on-device stack walking, such as the Cocoa and Android (NDK) SDKs.

You can read the output of minidump_stackwalk and then directly convert it into the schema expected by our Servers without additional processing steps. The mapping is roughly:

This allows you to send a prewalked stacktrace to Sentry for symbolication, and attach additional information such as breadcrumbs, tags, or the release version. Please let us know if you have questions about the schema or if we could help with the mapping.

Click to expand example payload ```json { "timestamp": 1521713273, "contexts": { "os": { "name": "Windows", "version": "10.0.14393" }, "device": { "arch": "x86" } }, "exception": { "type": "EXCEPTION_ACCESS_VIOLATION_WRITE", "value": "Assertion Error: foo" }, "threads": [ { "crashed": true, "id": 1636, "registers": { "eax": "0x0", "ebp": "0x10ff670", "ebx": "0xfe5000", "ecx": "0x10ff670", "edi": "0x13bfd78", "edx": "0x7", "eflags": "0x10246", "eip": "0x2a2a3d", "esi": "0x759c6314", "esp": "0x10ff644" }, "stacktrace": { "frames": [ {"instruction_addr": "0x2a2a3d"}, {"instruction_addr": "0x2a28d0"}, {"instruction_addr": "0x7584e9bf"} ] } }, { "id": 3580, "registers": { "eax": "0x0", "ebp": "0x159faa4", "ebx": "0x13b0990", "ecx": "0x0", "edi": "0x13b4af0", "edx": "0x0", "eflags": "0x216", "eip": "0x771e016c", "esi": "0x13b4930", "esp": "0x159f900" }, "stacktrace": { "frames": [ {"instruction_addr": "0x771e0160"} ] } } ], "debug_meta": { "images": [ { "type": "pe", "code_id": "5ab380779000", "code_file": "C:\\projects\\crash.exe", "debug_id": "3249d99d-0c40-4931-8610-f4e4fb0b6936-1", "debug_file": "C:\\projects\\crash.pdb", "image_addr": "0x2a0000", "image_size": 36864 }, { "type": "pe", "code_id": "5a49bb75c1000", "code_file": "C:\\Windows\\System32\\rpcrt4.dll", "debug_id": "ae131c67-27a7-4fa1-9916-b5a4aef41190-1", "debug_file": "wrpcrt4.pdb", "image_addr": "0x75810000", "image_size": 790528 }, { "type": "pe", "code_id": "59b0d8f3183000", "code_file": "C:\\Windows\\System32\\ntdll.dll", "debug_id": "971f98e5-ce60-41ff-b2d7-235bbeb34578-1", "debug_file": "wntdll.pdb", "image_addr": "0x77170000", "image_size": 1585152 } ] } } ```

Supporting minidump_stackwalk

There is no technical reason against adding support for the output format of minidump_stackwalk. However, we believe in portable and reusable APIs for error and crash ingestion. Our event schema serves a multitude of platforms outside of native crashes, like Python or JavaScript. For this reason, in most cases we gravitate to evolve the existing protocol rather than adding new formats.

We do adopt widely spread formats that are otherwise hard to convert. As part of that, our ingestion pipeline supports Minidumps, UE4 crash archives, and Apple crash reports. I think that we can evaluate adding the minidump_stackwalk output format in the future, but at the moment it looks like there are viable alteratives to that.

System Symbols

I agree, on-device stackwalking has many advantages particularly when third-party symbols and system symbols are not available. For macOS, iOS and Windows, we maintain a comprehensive repository of system libraries particularly for stackwalking and basic symbolication (using symbol tables only), and we integrate with third parties such as the Microsoft Symbol Server.

Unfortunately, this falls short on Linux and Android. There have been recent advancements to introduce symbol servers for Linux, such as debuginfod, but we still have a long way to go to. On Android, fragmentation makes matters much worse and on-device stackwalking remains the only reliable strategy to achieve proper stack traces.

Stackwalking and Symbolication

I can currently see a few major challenges with running minidump_stackwalk on-device:

Status of Crash Support

Sadly, the current status of sentry's native crash support makes it often impossible to investigate crash reports, as the stacks are completely bogus.

This is a very valid point, and thank you for bringing this up. While we continue to integrate more symbol servers and supply unwind information for Linux and Android, there will always be missing symbols. With FPO and heavy optimization, there is no guarantee we can create sane stack traces in those cases.

There are a few things on our mind on how we are going to address this:

Thank you so much for offering your help. What do you think about the above, and what can we do to improve the service?

milianw commented 4 years ago

Hey @jan-auer, thanks a lot for this in-depth reply. I agree that your proposed immediate solution sounds like a valid approach! My searching through the documentation didn't bring this up, many of the other pages seem to indicate that "minidump" is the only supported format. Check out the following link for example:

https://docs.sentry.io/platforms/native/advanced-config/

On Windows, Linux, and macOS, the default backends send binary Minidump files. Sentry processes them to extract stack traces and exception information into a readable crash report.

Similarly, there's documentation on minidump (e.g. https://docs.sentry.io/platforms/native/guides/minidumps/) and other platforms, but the information you have added into this bug report here isn't as readily available on the documentation. From my side, I would suggest you just take what you wrote above and add it as a "manual" guide to native crash reporting - it's super valuable!

We will be trying this out on our side and then report back with separate issues as we encounter them. I totally believe this answers my request. minidump_stackwalk was mostly a ready-made tool that shows what one could do, I didn't indent to indicate that it must be the final solution.

Some more remarks to what you wrote:

Stackwalking and Symbolication

In our case, we integrate the symbol extraction into our CI, which means we can run whatever platform specific tool we need to.

Regarding space savings for split debug info: It would be great if sentry could support compressed files, as the size of the debug information can usually be dramatically reduced by applying gzip or similar. Though maybe the better solution is to rely on gcc's -gz, MSVC's /PDBCompress or similar...

We are not targeting server applications, so it's not feasible for us to ship the symbols to the client. We do have to keep the symbolication step on the server.

Status of Crash Support

I very much appreciate that you will be adding support for reprocessing crashes - that would be (have been) extremely useful for us.

Regarding the prewalked stack trace attached to a minidump - that sounds pretty interesting too. What I don't understand though is:

Sentry could use such a trace as a fallback when symbols are missing.

When symbols are missing, the stack frames would still be there - one isn't directly related to the other, except when it comes to inline frames? Do you maybe mean "when CFI are missing" or similar, which would lead to broken backtraces?

Finally, the approach you describe to use on-device stack walking (but not symbolication!) in your SDK sounds perfect to me. I'll try to find some time to put someone from my team on the task to essentially clone minidump_stackwalk into a tool that outputs a sentry JSON event payload instead. If we get to that, I'll try to share it somewhere - maybe it's going to be useful for you too then.

Thanks again