WebAssembly / tool-conventions

Conventions supporting interoperatibility between tools working with WebAssembly.
Artistic License 2.0
302 stars 67 forks source link

Original Code Section Offset in Split DWARF Files #155

Open mitsuhiko opened 3 years ago

mitsuhiko commented 3 years ago

So since browsers report the absolute offset within a WASM file and do not have access to the code section offset, debugging tools that only operate on such offsets (such as Sentry or other crash reporting services) need to calculate the original offset. Right now we do this by forcing when split DWARF information is used (we use the proposed build_id #133 section to match the files, but the same issue arises when external_debug_info is used) to retain all original sections in the separate debug file including the Code section at the right offset.

There are three potential options here I see:

  1. Provide a original_code_offset section with the offset of the original code section and embed that in split debug info files.
  2. Change browser rendering (and suggest the same for runtimes) to not only report the absolute offset but also the code relative offset in stack traces.
  3. Add an API to access the code offset to the browser runtime so that the correct relative offset can be sent to crash reporting services.

Aside: In general I think there are some option questions about how this is supposed to work in practice. We're also running into issues in matching stack traces to the correct wasm files because of the limitations of the stack trace format. If one uses WebAssembly.instanciate with a buffer instead of WebAssembly.instanciateStreaming to load web assembly the stack trace format in browsers is completely inadequate to figure out which web assembly module a frame belongs to. As an alternative (if build_ids become an accepted format) it might be preferrable to add build ids and relative to build id offsets into the stack trace. Eg http://localhost:8088/lib1.wasm:wasm-function[1]:0x86 @ 483a64fa956ad1c848328c52f15dcc0bce1ca232+0x2) or something similar).

tlively commented 3 years ago

As you mentioned, the standard display format uses module offsets rather than code offsets. I'm probably missing something obvious, but why do Sentry and other tools use code offsets instead?

dschuff commented 3 years ago

My experience (mostly with LLVM-ecosystem tools, Binaryen, Wabt, and DWARF) has also been that code section offsets are more generally useful than module offsets; one reason is that most of those use cases work (or at least start) with object files, and object files always use section offsets which are relocatable anyway (this is also true even for ELF/MachO, where relocatable section offsets are used instead of code addresses).

But also IIRC on wasm, DWARF uses section offsets even for linked binaries (whereas ELF uses virtual addresses). My guess is that this is because this makes linking much simpler (since the linker doesn't need to worry about how large the other known sections are when it does the code section layout and relocation); perhaps @sbc100 remembers? This independence from other sections also could make post-link processing by other tools easier. (ELF doesn't have this problem because the virtual address space is independent of the binary).

I'm not sure I quite understand the problem you mention with external debug info though. Are you referring to "split" debug info (i.e. -gsplit-dwarf where the debug info is split into N pieces (where N is the number of object files) and not linked at all? Or do you just mean emscripten's -gseparate-dwarf flag where the debug sections are stripped out and replaced with an external_debug_info section? I'm guessing the latter. In that case though, IIUC the code section offset in the final wasm binary should be the same as it was before the debug info was stripped out. I guess I assumed that if you are keeping debug info, you'd also want to keep all of the original sections anyway?

Having said that though, I definitely agree that this mismatch is annoying and that we can improve things. Your suggestion 1 would be pretty easy. Currently (due to a limitation in LLVM's strip/objcopy functionality) the original code section actually remains in the .debug.wasm file rather than being stripped out when using -gseparate-dwarf. I intend to eventually fix that, and in that case it would make sense and be straightforward to replace the code with some metadata including the original code offset. (As an aside, I imagined we'd also strip out all the other known sections, such as exports. It's plausible that a debugger would want that info too; as I said, I had imagined that anyone archiving debug info would also want to archive the rest of the binary too).

Suggestion 2 would be harder since it could be a breaking change for tools that parse the stack trace output. But maybe appending to it could work.

Suggestion 3 could also work. Currently the Module object has an access for the custom section data in addition to the imports and exports; we could presumably add a way to get the code section offset, or perhaps even some more metadata about the binary or the sections.

and finally; yes it is a bit unfortunate that arraybuffer-based instantiation is basically equivalent to eval and lacks a good way to identify module. In principle, the wasm binary could have a name section (with just a module name, if size is an issue) that should be added to the function name (even to the "generic" function name). e.g. modulename.functionname or modulename.wasm-function[1]. I'm not sure if that happens in practice; if not, we should fix it. But I also agree that a build ID has uses too; it probably makes sense to push more on that proposal as well.