Build ID Section for WASM

mitsuhiko commented 4 years ago

I originally brought this up in the design repo (https://github.com/WebAssembly/design/issues/1306) but I believe this fits here better.

For deferred symbolication on services like sentry it would be nice to be able to match up DWARF debug information to the main WASM file by build ID. In ELF this is typically accomplished with the GNU build ID note, on windows with the PDB signature and age and on darwin the macho UUID fulfills that purpose.

I would love to see a build_id custom section that contains a 16 or 20 byte ID which tools would ensure remains in both WASM files (CODE, debug companion containing DWARF info) if they get split. Capping it at 16 bytes makes it possible to roundtrip this through breakpad which uses a 16+4 byte char array for the debug id. 16 for the PDB UUID + 4 byte for the PDB age.

Motivation: Sentry and other systems like to be able to look up files by build ID because then they can access an external symbol server for that information. That way one just provides some sources where debug information can be found and then symbolicators just reach out to that service to find the debug information files.

kripken commented 4 years ago

Sounds good to me.

Would the build ID be modified by tools that process the wasm, say by Binaryen when it optimizes the binary? Or would it stay fixed after it's emitted from the original compiler? (The former seems to make sense, as optimizations change the binary, but then we'd need to describe those changes here I think.)

mitsuhiko commented 4 years ago

Definitely should change as the file changes.

Of note is that in the Microsoft ecosystem the age on the PDB signature (those extra 4 bytes) get incremented with every transformation. This from my experience has made things more complicated in practice because they were not consistently changed everywhere. For instance the age is stored more than once in the PE format and actually comes out desynched from Microsoft's own tools.

I think it would be wiser to explicitly tell tools to always completely override the embedded ID if it goes through a transformation. This does mean you can't track back to the original ID of the originally created WASM file but I'm not sure if that is necessary in general.

Would be curious to hear though if there are some advantages of the pdb+age system on the Microsoft side.

sunfishcode commented 4 years ago

How important is it to have an explicit field for this, as opposed to just having tools compute a hash of a wasm binary to use as an effective build ID?

mitsuhiko commented 4 years ago

@sunfishcode since the ID needs to survive a stripping of the file, it's very important. With DWARF in place you normally want to separate out the object file into two: one that contains CODE and other sections necessary to run the code, a second one with the DWARF sections (.debug_frame etc.). Since stripping/splitting the files changes the file you cannot reproduce the ID after the split.

sunfishcode commented 4 years ago

Can tools just hash the contents of the main wasm sections then, and ignore debug info sections?

I don't have a strong opinion either way yet; I just want to understand the space.

mitsuhiko commented 4 years ago

To generate the build ID they could take the hash of the main wasm sections and store it in the file. They can alternatively just generate a random UUID and embed it. I do think though that a build ID should ideally always be embedded.

(This here describes the workflow where this information is particularly useful)

sunfishcode commented 4 years ago

I'm curious about what situations storing a Build ID in the file is better than computing a hash on demand whenever it's needed.

Naively, computing it on demand would seem to have several advantages:

tools can pick an appropriately strong hash function for their own use case
tools can decide which sections to hash and which to ignore for their own use case
tools don't have to worry about other tools stripping or omitting the Build ID
tools can trust that no other tools have modified the binary without updating the Build ID
tools can trust that the Build ID is not maliciously crafted to cause collisions, if that's important
tools don't have to worry about the length of the Build ID varying

Jake-Shadle commented 4 years ago

FWIW Breakpad supports the embedded, format specific identifiers, that @mitsuhiko mentioned, but if they aren't available for any reason, it falls back to computing an md5 of the first 1024 bytes of the TEXT section (or equivalent).

You bring up some good points about some advantages computing the build ID has, but to me the point of the Build ID is to precisely identify a particular build so that different tools can always pair the code with the debug information, so allowing tools to choose their own hash function or which sections to hash, brings up problems when tools need to communicate with each other, eg. between a debugger and a symbol store that use different hash functions.

So storing the Build ID does have some disadvantages, particularly when a tool does a transformation that doesn't also change the Build ID, but I'm much more concerned with tools having a consistent source of truth.

sunfishcode commented 4 years ago

I've now found this stackoverflow post which I found helpful. The Build ID isn't just a hash of the contents; it's something like a hash of the contents and the debug info together, which is then recorded and preserved, even if debug info is stripped. As such, it can't always be recomputed.

There are a lot of use cases other than debug info that would seem to want something like a Build ID, but what they need is something subtly different from what the Build ID actually is. So, brainstorming here, what if we do have a Build ID section, but call it the "Debug Info ID", and say:

Compilers producing wasm without debug info don't need to include a Debug Info ID.
Post-processing tools could decide whether to strip the Debug Info ID based on whether their transformation would invalidate any associated debug info.

Would that make sense?

mitsuhiko commented 4 years ago

I think it's fair to specifically call this a debug_id and record that it's useful for that purpose.

The hashing fallback path of breakpad has caused more issues than it solved so I would prefer we don't spec out something like this.

sbc100 commented 4 years ago

Would this build ID be generated at the point when the debug info is split out (either by the linker, or some kind of post link debug-splitting tool)? Or would it be present even in binaries that still have their debug info embedded?

mitsuhiko commented 4 years ago

@sbc100 definitely already in binaries that have the debug info embedded. We for instance have lots of cases where we want to symbolicate stacktraces where the client just submitted instruction addresses and then people upload the entire binary with debug information included.

This is especially important normally when doing stack unwinding out of memory dumps. This obviously is less useful for wasm right now, but in terms of existing work flows having the debug ID even in unstripped binaries has been very valuable.

RReverser commented 4 years ago

Just as a counter-point, one downside of an embedded id seems to be precisely that it would usually survive destructive operations on the code.

That is, if code is post-processed by a tool similar to wasm-opt or wasm-bindgen, and if that tool can't correctly update DWARF information, then the build id would remain the same even though the code has changed and no longer matches the debug info. In this case you as a consumer (Sentry or otherwise) explicitly don't want such debug info to be matched and used.

Arguably, every such tool should either support DWARF or be able to at least change build ID to some new unique value, but it seems that hashing of code section would alleviate this concern even more naturally.

mitsuhiko commented 3 years ago

Since we're adding WASM DWARF support at Sentry at the moment we might be going ahead and require customers to embed a build_id custom section into their files for now.

RReverser commented 3 years ago

@mitsuhiko Does the "hash of the code section" idea not work for you?

mitsuhiko commented 3 years ago

@RReverser Generally I did not define how the build_id section so far is to be computed. However since the code section is inaccessible from within JavaScript but custom sections are available, I cannot compute it on demand. So a user for us can either compute the build_id by hashing the code section or alternatively just embed a random UUID, either way the result from our perspective is the same.

For what it's worth embedding a random build_id is easier to accomplish with the existing rust toolchain as it can be accomplished with #[link_section] on a static byte literal whereas making it a hash requires injecting the custom build section after the fact. I was attempting to do this with walrus but unfortunately that appears to do something nasty with the DWARF data in the WASM file currently.

RReverser commented 3 years ago

I was attempting to do this with walrus but unfortunately that appears to do something nasty with the DWARF data in the WASM file currently.

Yeah, walrus is a high-level IR and, as such, rewrites even the code you didn't touch, which, in turn, affects debug offsets. You need a lower-level representation instead, e.g. [shameless plug] you can try my wasmbin library which was created with similar use-cases in mind. https://github.com/GoogleChromeLabs/wasmbin

RReverser commented 3 years ago

I've pushed an example for random build_id (based on UUID v4) here: https://github.com/GoogleChromeLabs/wasmbin/blob/build_id/examples/build_id.rs

You'll probably want to extend it to be more robust (e.g. add detection of existing build_id section), but it works and attaches a section successfully.

mitsuhiko commented 3 years ago

Oh this is neat. Going to use this.

RReverser commented 3 years ago

Come to think of it, due to the nature of Wasm binary format, if you didn't want to check for presence of existing build_id, you could even literally append bytes representing the custom section to the end of the file:

fn main() {
    let filename = std::env::args()
        .nth(1)
        .expect("Provide a filename as an argument");
    let mut f = OpenOptions::new().append(true).open(filename)?;
    f.write_all(&[
        // Custom section (id=0)
        0x00,
        // Length of payload (length of length of name + length of name + length of UUID)
        1 + 8 + 16,
        // Length of name
        8,
    ]);
    f.write_all("build_id".as_bytes())?;
    f.write_all(uuid::Uuid::new_v4().as_bytes())?;
    Ok(())
}

Won't save too much in terms of perf and the code won't be as clean, but hey, it's possible in case you want to avoid any dependencies altogether and make a tiny util :)

mitsuhiko commented 3 years ago

I extended your tool into one that does not override existing build IDs and also splits the file into two: https://github.com/getsentry/symbolicator/pull/303

dschuff commented 3 years ago

I think we should pick this up and add support to LLVM/emscripten to make this easier. Is this a correct summary of people's current thoughts/current usage? 1) We are thinking of build_id similar to ELF, in that it reflects the semantics/origin/sources of the program and therefore: a) it conceptually includes the debug info b) It should be changed (dropped?) by any tool that modifies only the code (since that would invalidate the debug info). @sunfishcode says above that any transformation that wouldn't invalidate debug info wouldn't need to rewrite the ID. That makes sense to me, although the set of such possible transformations seems rather small. c) It should be changed by any tool that modifies the code and updates the debug info (this is kind of a funny thing to say for an optimizer that isn't supposed to change the semantics of the program, but I think it's correct because of course changing the code will change function indexes, section/module offsets, etc) 2) Current tools (other than emscripten) just add a section called build_id with a random UUID

@sunfishcode also suggests above that tools not write a build ID if they don't generate debug info. I don't really see the harm either way; a wasm file that never had debug info will be indistinguishable from one that had debug info stripped out. Thinking about this some more: there is no practical way to tell whether a file has been modified incorrectly (i.e. rewriting the code section but failing to change the build id), or modified at all. In other words, if it is known that the build ID is e.g. a hash of all the known sections plus specified debug info sections, a tool could verify that a wasm file with debug info (or one that never had debug info) hasn't been modified, but won't be able to infer anything from a file with no debug info and an "incorrect" hash, since it can't tell whether a file previously had debug info or not. (Unless we also embed some kind of indication in the build id or otherwise in the wasm, that there was previously debug info, and ask tools not to strip that out. Not sure if that's worth it or not).

If we specify that (or even just implement the linker such that) the build id is a hash of some file contents, that would slow down linking, so we'd want to get some benefit in return for it.

dschuff commented 3 years ago

/cc @walkingeyerobot @trybka

trybka commented 3 years ago

I don't think we want a random UUID for build_id in (2), do we? Ideally the same inputs should generate the same outputs, including build_id -- remote builds care a lot about this kind of reproducibility.

dschuff commented 3 years ago

yeah build determinism is a good point, LLVM and emscripten should definitely have that, even if other tools might not care. GNU ld and ELF lld actually have both options (hashing sections, picking a random UUID, and using a value specified on the command line). I guess that probably means we need to hash all of the sections that LLVM produces by default, including: 1) all of the known sections 2) all of the debuginfo sections 3) name section, on the same grounds that the debuginfo sections are included

... Actually, Looking at ELF lld's implementation, maybe we just want to hash the entire output file.

RReverser commented 3 years ago

all of the known sections

Is that really necessary? E.g. if some bundler-like tool changed import paths in the Wasm module or mangled export names, this doesn't affect debug offsets in any way so it seems useful to allow the resulting Wasm to still work with the original debug info.

RReverser commented 3 years ago

Although, I suppose we can always start with hashing the entire file and loosen it up later down the road if deemed useful.

dschuff commented 3 years ago

I think any tool that modifies the binary after link would probably have to make a case-by-case decision on whether to modify the build id, no? Some of those changes (e.g. export names, IIRC?) could affect how the engine presents stack traces to the developer, and some wouldn't. And the user might want to store the pre-mangled version or the post-mangled version. But I think the tool would have the option to leave the build ID in place. The hashes would then no longer match, but that's only a problem if you want to add an extra verification step.

RReverser commented 3 years ago

But I think the tool would have the option to leave the build ID in place. The hashes would then no longer match, but that's only a problem if you want to add an extra verification step.

Ah, that's true that it can be left up to the tools. In that case full-file hash seems perfectly fine.

dschuff commented 3 years ago

https://reviews.llvm.org/D107662 is a prototype patch against LLD for adding a build ID. It implements the same features that the ELF version does; namely, that the build ID can be one of: a) a hash of the entire object file, including debug info (one of sha1, md5 or xxhash) b) a generated UUID c) an arbitrary hex value specified on the command line It's pretty straightforward, I guess the interesting thing is what downstream tools do, as discussed above.

Currently it's nothing more than a custom section named build_id that contains a wasm-string (length-prefixed) with the value. I wonder if it would be worthwhile to encode the hash type too?

Actually it just occurred to me that wasm strings have to be valid UTF-8, right? So we can't just call it a wasm string, maybe we'll just have to make it the equivalent length + arbitrary bytes.

sunfishcode commented 3 years ago

There was earlier discussion considering calling this a debug_id; does that still make sense?

dschuff commented 3 years ago

I don't necessarily have a problem referring to it as a debug ID. Although since build systems and/or users may already know how about the --build-id flag it seems like it would be unfortunate if that flag didn't work and they had to discover the existence of and use a differently-named flag that did the same thing for the same purpose as the one they already knew about?

mitsuhiko commented 3 years ago

Personally I think given a choice between considering this to be a property of a build vs a property of the debug info, I would prefer the former. While its primary use obviously is to identify the debug info, it's also generally used to target other information. As an example from the ELF world we don't just use this information to find the ELF debug files, but also the binaries to access the unwinding information.

dschuff commented 3 years ago

We talked about this a little more, and given that we will probably just support the same functionality as ELF, it makes sense to call it the same name. But we should maybe have documentation here in tool-conventions about what it is and what it's for, perhaps with guidance or use case examples for tool authors who might modify the binary after link.

bkotsopoulossc commented 2 years ago

What are the next steps here?

dschuff commented 2 years ago

I think we have agreement in principle, someone just needs to finish the implementation. I actually started that a while back (https://reviews.llvm.org/D107662) but didn't finish. I can hopefully find time to get back to it again; or, if someone else wants it ASAP and is interested in picking it up and sanding off the rough edges, I'm sure Sam and/or I would be happy to review it and commit the result when it's done.

bkotsopoulossc commented 2 years ago

@dschuff awesome thanks. Do you think it would make sense to start with a spec readme first, like https://github.com/WebAssembly/tool-conventions/blob/08bacbed7d0daff49808370cd93b6a6f0c962d76/Debugging.md? Just so there's agreement on how this is laid out?

Also do you have any pointers on what specifically needs to be done with that llvm change? I don't see any comments or TODOs

dschuff commented 2 years ago

Ah, looking back at that code, there are a couple of details about the format of the section itself: In particular, ELF build IDs support several different kind of hashes: "fast", MD5, random UUID, SHA-1, and arbitrary user-supplied hex string. I can see use cases for several of those, and it would be very straightforward to support all of them in lld. Is there any reason not to?

Then there's the format of the section itself. The most straightforward encoding would be

a ULEB field designating the hash type (assuming we ever support more than 1)
a length-prefixed wasm string field containing the hash itself. Although IIRC the "standard" wasm binary-format strings need to be UTF-8 and come to think of it I'm not sure if build-id hashes are strings, or just arbitrary binary data. So we'd have to figure that out.

dschuff commented 2 years ago

(sorry we raced). Yes, a tool-conventions doc like that one would be perfect, to specify the section's format. As for the LLVM change, IIRC I tried it out on a simple case and it seemed to work; the main thing it needs is tests (e.g. for different hash types, and maybe use of the feature in conjunction with other linker features such as synthetic sections and relocatable output). Also, the way it works (by writing a placeholder during the normal synthetic-section generation phase, and then writing the real hash in a special phase at the very end) seems slightly ugly to me, but I don't know of a better way to do it; maybe @sbc100 would have an opinion on that.

sbc100 commented 2 years ago

That approach seems reasonable to me. I guess this is not unlike relocation entries which get written with placeholders and then updated. The difference here is that we could obviously need to wait until all other sections have been written since we could be hashing their final content.

bkotsopoulossc commented 2 years ago

Thanks for the extra details. Some thoughts (with the caveat that I am not familiar with other conventions here from ELF or other formats):

If we do encode the hash, maybe just a uint8 that is essentially an enum, like the tag type here
I'm in favour of having arbitrary binary data and not having to think of this as a string, as it feels more generic
Maybe having the hash type could be useful to be able to parse or interpret the build ID data in some way. But maybe it's sufficient for all consumers to just treat it as opaque binary data, and not have to care how it was hashed.
I do wonder if it's worth fleshing out all of these different types of hashes now, or just start with a strawman that is extensible to different types in the future. For example, it sounds like starting with just a random ID avoids some of the ambiguity in llvm around placeholders and such. To me, getting this into the spec and the binary format is a big win - adding more options around how the build ID is generated could be done later

dschuff commented 2 years ago

Yes, an enum is what I had in mind. If there are < 128 values, a uint8 is the same as an LEB so it doesn't really matter what we call it.
Yes, it does look like binary data. Actually I should have reread this thread because it's the same conclusion we came to already above 🤣
Regarding the hash type and what tools do with it, there's also discussion of that above. Per that discussion I think the primary/default build ID type at least for LLVM needs to be deterministic (so, a hash rather than a random ID). As discussed above, tools that modify binaries will probably have to make a case-by-case decision about whether to modify the build ID too. But it might be useful for them to know that hash type in that case? Maybe also any tool that wants to print the build ID might want to know the type (so it could e.g. format UUIDs differently from hashes)? Maybe it's not useful to distinguish different hash types though?

bkotsopoulossc commented 2 years ago

Ahh yeah I guess random is problematic when it comes to reproducible builds. Maybe the user-supplied string is an easy one to start with then? The idea of supporting the various different types sounds great but just seeing if we can scope this down a bit so its easier to make progress on

mitsuhiko commented 2 years ago

As for hash format there is probably quite some flexibility here but traditionally the limitations were often the intention to support some form of breakpad compatibility. The default debug id field has space for a UUID/GUID + 4 bytes as u32 (the age field). Since Macho selects a UUID for the hash and PDB uses this UUID + 32bit age it's probably not a bad idea to encourage tools to emit a reproducible UUID (v3 or v5) as build ID. That has the highest form of compatibility.

Knowing which exact type of a build ID something is has not been useful in our experience.

(For additional context this is the abstraction we use for what we call breadpad compatible debug ids: https://docs.rs/debugid/0.7.2/debugid/struct.DebugId.html — any gnu build ID longer than 16 bytes is chopped off and an age of 0 is always used. We then use the original gnu build ID as secondary information for debug file lookup. Our symbol server lookup strategies are documented here: https://getsentry.github.io/symbolicator/advanced/symbol-lookup/)

dschuff commented 2 years ago

Sorry I've sat on this so long. Let's finally get it done. I uploaded #183 which I think captures what we've discussed here. After hearing @mitsuhiko's experience that knowing the exact type of ID isn't useful (and not being able to think of any use myself) I decided to just leave it out of the encoding.

dschuff commented 2 years ago

Also I just realized that I didn't take @mitsuhiko's advice and encourage a reproducible UUID as the output (or implement one in lld in https://reviews.llvm.org/D107662); instead I went with the same default lld uses for ELF (which is actually just an 8-byte "fast" hash). Do you think that's compatible "enough" or should we invent something new in lld?

dschuff commented 2 years ago

@mitsuhiko I guess a followup question, if I were to make lld generate a v5 UUID (based on, a hash of the contents), what would I use as the "namespace" UUID to go with it?

bkotsopoulossc commented 2 years ago

Would it be reasonable to just generate a random UUID once and bake it into the llvm code, as an "llvm namespace"?

mitsuhiko commented 1 year ago

@dschuff about the namespace it probably doesn't matter. You can probably hardcode a random ID and just use that consistently and document it. I don't have any expectations that there is a tool independent way of generating the same reproducible IDs. It's more important that the tool itself has some stability.

dschuff commented 1 year ago

I updated the prototype in https://reviews.llvm.org/D107662 It supports several different styles for compatibility (mostly the same ones as ELF). The default style ("fast" aka "tree") hashes the contents of the output and (unlike ELF) generates a v5 UUID based on the hash (using a random namespace). It also supports generating a random v4 UUID, a sha1 hash, and a user-specified string (as ELF does).

WebAssembly / tool-conventions

Build ID Section for WASM #133