WebAssembly / debugging

Design documents and discussions about debug support in WebAssembly
109 stars 10 forks source link

DWARF for WebAssembly Target #1

Open yurydelendik opened 6 years ago

yurydelendik commented 6 years ago

I started documenting the findings that were done during my work on saving LLVM debug information as custom sections (see D44184 and D45118). The LLVM prefers DWARF format, and it is doable to package entire DWARF data into wasm custom section and convert it into wasm binaries source maps later. These findings can be found at https://github.com/yurydelendik/webassembly-dwarf/. Also, I attempted to match the findings with @fitzgen's WebAssembly Debugging Capabilities just for information purpose.

This issue is an attempt to open discussion about if it will be valuable to continue packaging debug data as custom sections in DWARF format.

yurydelendik commented 5 years ago

There is D52634 opened. Let's outline the short-term goal here. The long-term goal is being documented (I recently updated) in the finding document.

The debug information, produced by LLVM, is not capable to expose WebAssembly specific items in the DWARF expressions. I considered the idea of using registers operands to express WebAssembly locals, globals, an operand stack, etc. It complicates logic around the encoding/decoding of such items and it still do not provide any value -- the WebAssembly specific DWARF expressions need to be transformed or special-cased during an evaluation process.

So the short-term goal is to extend DWARF expression with WebAssembly specific items such as locations for locals, globals and stack operands. The DW_OP_WASM_location with code 0xED can be added to the DWARF expression language. This operation will have two operands.

The DW_OP_breg operations are used to express WebAssembly location. The first operand (register number) defines the type of the WebAssembly location. It is encoded as ULEB128 constant. The second operand (the offset) is a location item index and it is encoded as SLEB128 constant.

Type Description Second Parameter
1 Local or Argument The index of a WebAssembly function local/parameter
2 Global The index of a WebAssembly module global
3 Operand Stack Item The depth of the item (0 - points the bottom of the stack)

The dump tools, such as llvm-dwarfdump, will be extended to display such expression in e.g. wasm-location(type, index) format. The existing debuggers (e.g. lldb or gdb) or other tools will not be able to decode such expressions, and that is okay since they cannot evaluate in the WebAssembly context anyway. The idea for the latter is to transform WebAssembly specific locations into native locations (registers, memory, etc) during AOT or JIT compilation.

Current TODOs:

yurydelendik commented 5 years ago

Pre reviewers require, replacing DW_OP_WASM_location with DW_OP_breg operators. See changes above

justinclift commented 5 years ago

With that doc, should the date on the sub heading be updated from "21 May 2018"? It sounds like it hasn't been changed in ~6 months. :smile:

yurydelendik commented 5 years ago

With that doc, should the date on the sub heading be updated from "21 May 2018"? It sounds like it hasn't been changed in ~6 months. 😄

The "WebAssembly Debugging Capabilities" doc is 6 months old, yes. The "DWARF for WebAssembly Target" at https://yurydelendik.github.io/webassembly-dwarf/ is last modified 13 December 2018.

justinclift commented 5 years ago

Thanks @yurydelendik. :smile:

AndrewScheidecker commented 5 years ago

In this proposal, I'm using a "function label index space" to refer to labels from the binary name section.

I wrote a rationale for using function label indices over binary offsets that I'll paste here:

This proposal defines a "function label index" space in addition to the scoped label index space already defined in the spec. This function label index space assigns indices to the labels in a function in the order their corresponding structured control instruction occurs in the function body.

Other proposals that deal with references to instructions seem to use binary format offsets:

  1. DWARF for WebAssembly
  2. The display conventions for WebAssembly locations
  3. Source maps applied to WebAssembly binaries

The reasons for using the function label index space over a binary offset in this proposal are:

  1. Any in-memory representation of the WebAssembly abstract syntax must already be able to encode the order of the structured control instructions, and that order implicitly defines the function label index space. Using binary offsets would require an implementation to keep track of the binary offset a structured control instruction was read from.
  2. The function label index space also densely maps integers to names, while binary offsets would sparsely map integers to names. That makes it more practical to use a simple data structure to store the names in memory.
  3. Finally, a binary offset would mean that a far greater proportion of WebAssembly module transforms would need to also transform the binary offsets in the name section. The function label index space is not changed by any transform that doesn't change the structured control instructions.

I believe parts of this rationale also apply to the code addresses in DWARF for WebAssembly; you reference instructions through a binary offset relative to the code section. Have you considered using an index into a function's instruction sequence instead?

yurydelendik commented 5 years ago

Have you considered using an index into a function's instruction sequence instead?

Yes, I did. Function Index + instruction index creates a compound key and it is harder to maintain, as well as module wide instruction index. Since most of WebAssembly tooling adapted bytecode offset, there is no reason to invent something that will not be useful, e.g. the optimization tools change the instructions (order), so instructions indices will be changed thus giving no advantage over bytecode offset for processing debug information.

AndrewScheidecker commented 5 years ago

It's true that it will be harder for a transform to avoid the need to transform the DWARF sections, but there are some useful transforms that don't change the instruction sequence of a function:

  1. Translating a binary module to a text module renders the binary offsets meaningless.
  2. A "layer 1" WASM binary format would have a hard time preserving DWARF code offsets.
  3. Linking WASM object files needs some special relocation types for binary offsets that AFAICT are only needed to relocate DWARF binary offsets. (see R_WASM_FUNCTION_OFFSET_I32 and R_WASM_SECTION_OFFSET_I32 in https://github.com/WebAssembly/tool-conventions/blob/master/Linking.md).
yurydelendik commented 5 years ago

Translating a binary module to a text module renders the binary offsets meaningless.

Annotations needs to be introduced into text format, similar to LLVM's .ll approach. I don't think it will reasonable to preserve DWARF sections data as-is during round trip of text format without annotated instructions and data segments.

Linking WASM object files needs some special relocation types for binary offsets that AFAICT are only needed to relocate DWARF binary offsets. (see R_WASM_FUNCTION_OFFSET_I32 and R_WASM_SECTION_OFFSET_I32 in https://github.com/WebAssembly/tool-conventions/blob/master/Linking.md).

The R_WASM_SECTION_OFFSET_I32 will be needed for DWARF sections regardless of bytecode/index key. R_WASM_FUNCTION_OFFSET_I32 will be replaced with something different to relocate compound "(function index, instruction index)" key reference.

AndrewScheidecker commented 5 years ago

Annotations needs to be introduced into text format, similar to LLVM's .ll approach. I don't think it will reasonable to preserve DWARF sections data as-is during round trip of text format without annotated instructions and data segments.

The reason why it's easier to round trip DWARF through the text format applies the same whether you're translating the binary DWARF data to some annotated text syntax or not.

The abstract syntax that ties your binary and text format together will need some way to represent code addresses that's independent of their binary serialization. If the serialized format uses binary offsets, then serializing those code addresses will be tightly coupled to serializing the code section. For example, when you serialize the code section, you'll need to produce a map between instruction indices and binary offsets that you can use when you serialize the DWARF sections.

compound "(function index, instruction index)" key reference

I was thinking that the function index could be implied by the context, but I can see the DWARF format occasionally uses "code addresses" outside of the context of a function. Maybe it's possible to interpret code addresses to be function indices in some contexts (e.g. to define the code in a compilation unit) and instruction indices in other contexts.

yurydelendik commented 5 years ago

The reason why it's easier to round trip DWARF through the text format applies the same whether you're translating the binary DWARF data to some annotated text syntax or not.

A round trip wasm->text->wasm without changing the text is very narrow/insignificant use case and shall not be used to make decision how the instructions must be identified in the DWARF format IMHO. In most of the (superset) cases the text will be changed, which requires changes in the DWARF data.

function index could be implied by the context,

The .debug_line (and .debug_frame) section has no "contexts" and requires function instructions to be uniquely identified.

dschuff commented 4 years ago

There seems to be general consensus that some flavor of DWARF is what we want for at least the LLVM-based family of language and there are now several interoperable implementations of this spec. Can we check something into this repo that describes the current prototype? Then it would be easier to open specific issues or PRs than to add more comments here. e.g. I want to talk some more about the topic above (section vs binary offsets vs some abstract index space), but it would be better to have separate topics. Maybe an md doc would be best for now, for easy editing and since we don't know whether we eventually want bikeshed or what for the doc in the future.

codefromthecrypt commented 2 years ago

https://github.com/yurydelendik/webassembly-dwarf/ is a valuable GOTO for folks increasingly asking about DWARF in WebAssembly. However, it is a personal project and it seems odd for a large ecosystem to rely on this as a primary source. It also puts undo burden on personal time of the owner for things like answering issues as I don't think it was meant to replace W3C work rather stand in until something happens here. Since the time this issue was opened and now, I'm pretty sure several large projects are using this information in how they do dwarf in wasm.

Is there any way this can become canonicalized here or in the spec repo? If not now, how many implementations need to use another site ad-hoc until it becomes relevant? If there's some sort of bar to get over I can help hunt as I suspect we've already crossed it by now.

cc @rossberg

rossberg commented 2 years ago

@codefromthecrypt, the spec repo only contains documents that have gone through the process and that the WG has officially adopted as standards.

I think you mean whether dwarf support could be a repo under the WebAssembly organisation. For that, the champion would have to bring it to the CG as a proposal and ask for a vote.

codefromthecrypt commented 2 years ago

thanks for the response @rossberg! Anyone who knows can answer below if possible.

dtig commented 2 years ago

thanks for the response @rossberg! Anyone who knows can answer below if possible.

  • Does "champion" mean something besides a motivated individual? IIRC w3c membership requires sponsorship.

The champion is indeed a motivated individual (or a group of individuals) interested in pushing a feature forward. The Wasm Community group is free to join.

  • Can you give an example of a proposal that passed (ex was the proposal started as a GitHub issue)

The phases process of standardizing a feature is described here. Most features do start with a design issue, and then get moved into the WebAssembly organization as a proposal. Here is the list of finished proposals that have been merged into the spec after progressing through the process linked.

  • What does CG mean?

CG is the WebAssembly community group.

  • Do we know of any existing Webassembly members who have high stake in dwarf, just never gotten around to proposing it?

I'll defer this question to @dschuff or @yurydelendik.

dschuff commented 2 years ago

I actually think it would probably make sense to just start by adding the DWARF description to a doc in https://github.com/WebAssembly/tool-conventions/ which is where we also document related things such as the wasm object file format, LLVM C ABI/calling conventions, etc. That's just a matter of putting the information in a convenient format and making sure it still matches the reality of what e.g. LLVM is generating. Currently I don't know of any other toolchain besides LLVM that generates DWARF like this. If that stays the case, then it may not make sense to go for standardization. But I would be very interested in hearing of other producers or consumers.

codefromthecrypt commented 2 years ago

Thanks for the advice @dtig and @dschuff I understand the process and also what seems to be a short-cut start

I actually think it would probably make sense to just start by adding the DWARF description to a doc in https://github.com/WebAssembly/tool-conventions/ which is where we also document related things such as the wasm object file format, LLVM C ABI/calling conventions, etc. That's just a matter of putting the information in a convenient format and making sure it still matches the reality of what e.g. LLVM is generating.

some start like this makes sense. @7ombie @tromey @Jiboo @rianhunter @pfaffe @ggreif I know you contributed to https://github.com/yurydelendik/webassembly-dwarf I'm not sure if you are still active and have stake. If so, do you have anything to add to what should be in the first iteration of that doc. Future iterations can follow. If DWARF+Wasm is no longer relevant to you, please unsubscribe and forgive my spamming you.

Currently I don't know of any other toolchain besides LLVM that generates DWARF like this. If that stays the case, then it may not make sense to go for standardization. But I would be very interested in hearing of other producers or consumers.

wazero is the project I work on which has no dependencies, so doesn't rely on LLVM. My stake here is to help @r8d8 implement this with the best guidance possible https://github.com/tetratelabs/wazero/issues/58

If anyone else spammed here is working in a way that doesn't end up using LLVM anyway, please respond if you can. Let's get the first proposed PR with the best context!

RReverser commented 2 years ago

I actually think it would probably make sense to just start by adding the DWARF description to a doc in WebAssembly/tool-conventions

We sort of did that by linking to yurydelendik/webassembly-dwarf back in https://github.com/WebAssembly/tool-conventions/issues/148, but moving the DWARF integration doc itself into the repo also makes sense to me, assuming there are no objections from @yurydelendik.

codefromthecrypt commented 2 years ago

FWIW for me, moving the content is best as that widens the net of folks that can help maintain it, and formalizes an understanding beyond a personal repo. If we can't move it, we should recreate something similar.

RReverser commented 2 years ago

Currently I don't know of any other toolchain besides LLVM that generates DWARF like this.

FWIW (not that it contradicts your statement) Golang tried that too, but ran into some issues where their DWARF was not valid, and I think the author abandoned the PR afterwards. See https://go-review.googlesource.com/c/go/+/283012/ and https://github.com/golang/go/issues/33503.

yurydelendik commented 2 years ago

moving the DWARF integration doc itself into the repo also makes sense to me, assuming there are no objections

I have no objections. Let me know if something needs to be done on my end.

codefromthecrypt commented 2 years ago

@mbovel do you happen to know if graal's Wasm implements DWARF and/or if that is implicitly done via LLVM? (ps nice job on your WASI tests! https://github.com/oracle/graal/tree/89e4cfc7aeea69970b60c64cd075ceb2a104e864/wasm/src/org.graalvm.wasm.test/src/test/wasi )

mbovel commented 2 years ago

@codefromthecrypt no, GraalWasm does not support DWARF yet. (Thanks! I hope that there will be an official WASI test suite at some point 😄)

rianhunter commented 2 years ago

From a functional perspective the existing WebAssembly DWARF spec + LLVM implementation works well on my end. Only issue is that re-generating the DWARF tables in-memory for JITted code is relatively slow and those in-memory tables can be unexpectedly large (e.g. gigabytes). I've seen other run-times add options to ignore the DWARF tables but it would be nice if options like that weren't necessary. Not sure what e.g. Chrome does, it's entirely possible this is an implementation issue on my end. I haven't had time to further investigate since the core part of it just works but eventually I will get around to it.

I wonder if there is a way that we could structure the DWARF info in the WebAssembly binary such that it could be memcpy()'d in-place in memory and would only need some minor edits to get working with GDB + JITted code. E.g. leaving sufficient NULL bytes or No-ops in certain location descriptors so that they could be filled in later.

7ombie commented 2 years ago

The issue I'm having relates to using DWARF to debug WAT (I'm actually working on my own assembler, but the issue affects WAT and anything like WAT, so I may as well just focus on WAT, and my project will benefit indirectly).

The Extended Constant Expressions Proposal adds a few more instructions to constant expressions (basically, add, sub and mul for i32). This turns constant expressions into little routines that are executed/evaluated at runtime. While simple, these expressions can still contain errors, so they need to be debuggable.

You cannot map DWARF to the instructions inside constant expressions, as the offsets are relative to the Code Section, and constant expressions are not stored in the Code Section.

codefromthecrypt commented 2 years ago

@yurydelendik thanks for clearing the path for formalization of your work! So, next step seems to add your content as DWARF.md that seems to imply conversion to markdown and also creating an images directory https://github.com/WebAssembly/tool-conventions/

I also could offer to raise the PR converting the existing content to markdown, also. I think getting started in markdown while a formatting step back can help harvest the conversations here into a repo where action can be taken (ex an issue to hash out the valid point about constant expressions)

@RReverser will Yuri or I have access to raise a PR to https://github.com/WebAssembly/tool-conventions/ without a W3C account? If so, that seems like the best start right? If there's paperwork around that, could one of you do this on his behalf? While you are at it, can you use Github's "move issue" feature to move this issue to the other repo as that's where the change will occur? cc @dtig

codefromthecrypt commented 2 years ago

copying in @sbc100 as per https://github.com/WebAssembly/spec/issues/1428 I'm gathering there's a sense that compatibility is a solved or nearly solved topic here.

I am not trying to be problematic, but I think there's too much comfort in the status quo when things useful tend to not be defined by the w3c standard and often punted to 3rd party repos or left in issue cul-de-sacs like this.

When I first started in WebAssembly, it felt due to conferences promotions and such that there are some sort of staffing to maintain the spec towards compatibility by virtue of implementing it, as opposed to by virtue of looking at many non-standard repos or phases or subordinate repositories.

I don't think people mean to create a very high barrier to enter this ecosystem, or are actively hoping there's only one viable impl. However, if specs are left abandoned or moved around to READMEs things aren't easier.

This is the last unsolicited comment I'll make on spec repos about some root issues of abandonment or otherwise. If leadership desires more feedback about how compat or entrance into Wasm could be made easier, feel free to ask.

dschuff commented 2 years ago

Just to make sure I'm understanding you... You're saying that the fact that there's no real spec (and instead just a de facto standard based on one dominant implementation, along with some poorly-maintained almost-spec documents) is a problem, and increases the barrier to entry?

If so, I totally agree with you. (actually, even if that's not what you were trying to say, I agree with that statement 😅) From the Chrome side, we (primarily the Chrome devtools debugging team, but also the wasm toolchain team) have been working for quite a while on tuning the debugger implementation, in particular getting it to scale to large programs. This has involved some experimenting with different debug info options from the tools, and from my perspective I wasn't keen on trying to officially bless a standard until we were sure it was going to work well. But yeah I realize that has a cost for the broader ecosystem :(

The good news is that we've gotten it working pretty well, and I hope to put out some better developer documentation for debugging with emscripten and Chrome soon. And I do think we can go forward with basically what we have here as a spec. I'm still pretty behind, so I don't know exactly when I'd be able to work on it. But I'd be happy to help review if someone gets to it before I or someone else on our team does.

codefromthecrypt commented 2 years ago

@dschuff thanks for the consideration. Indeed the dominant issue I've found is sharing an implementation being the workaround to a gap in a spec, or even as a substitute for acceptance a gap exists at all.

I'm not really even fussed that a "spec" is governed by W3C at this point, just some way to achieve portability without sharing one implementation. Will definitely look forward to reuse of whatever you produce, even if limited to notes only.