Coredumps and multiple wasm modules

fitzgen commented 1 year ago

How are we supposed to represent coredumps when there are multiple modules involved? For example, these two modules are linked together and so B.g calls A.f which traps:

(module $A
  (memory ...)
  (func (export "f")
    unreachable))

(module $B
  (import "" "f" (func))
  (memory ...)
  (func (export "g")
    call 0))

It is easy to create such examples via JS doing the linking on the Web or with (for example) wasmtime::Linker elsewhere.

It seems like we need a module name or something in each stack frame. But actually module name isn't precise enough because the same module could be instantiated multiple times. We need a way to identify instances in the store.

If each module has its own memory, we can encode that in the current format, assuming that the coredump Wasm module relying on multi-memory is acceptable. However there is still the question of how to map the coredump's memories back to the original instances. This also applies to globals.[^tables]

[^tables]: I guess coredumps capturing tables is a non-goal since they aren't mentioned in the coredumps document? Definitely trickier. Things we could do, but I won't discuss them here. If interested I can open a new issue.

To address these issues, I think that there should be additional coredump sections to establish module and instance index spaces:

coredump-modules ::= customsec(vec(coredump-module))
coredump-module ::= 0x0 module-name:name
                  | 0x1 module-bytes:module

coredump-instances ::= customsec(vec(coredump-instance))
coredump-instance ::= 0x0 module:u32 memories:vec(u32) globals:vec(u32)

Here modules are either defined by name/URL/path or I also gave the option of bundling the whole Wasm module inline, which could be useful to avoid wrangling paths/URLs/etc in some scenarios. But if we don't want that second variant, we could just have names. Just throwing it out there.

Each coredump-instance describes what module it is an instance of (via index into the coredump modules index space), as well as which memories in the coredump Wasm file are its memories and which globals in the coredump Wasm file are its globals. That is, memory i in a coredump-instance is memories[coredump-instance.memories[i]] from the coredump wasm file's memory section, and similar for globals.

And finally, frame productions inside the "corestack" section would grow an instanceidx:u32 identifier:

frame       ::= 0x0 instanceidx:u32 funcidx:u32 codeoffset:u32
                locals:vec(value) stack:vec(value)

Thoughts?

cc @xtuc @itsrainy @dschuff @alexcrichton

fitzgen commented 1 year ago

If no one has objections, I can open a PR updating the spec to this effect.

xtuc commented 1 year ago

coredump Wasm module relying on multi-memory is acceptable

I think that's acceptable, Wasm coredumps aren't meant to be valid Wasm module but reusing the Wasm module format makes them easier to decode.

bundling the whole Wasm module inline

I'm not so sure we should do that. I'm concerned the coredumps are going to be big, maybe containing duplicated parts of Wasm modules.

Do you know what happens in a native environment with shared libraries? Otherwise, I think your proposal makes sense to me.

fitzgen commented 1 year ago

Do you know what happens in a native environment with shared libraries?

I'm not familiar with the innards of core files on linux, much less other platforms, but it seems like there is an entry for each VMA (very roughly equivalent to each linear memory in our situation) and then the PT_NOTES associates file information with each VMA (very roughly equivalent to Wasm instances in our situation) among other things.

At least, this is my understanding based on reading https://www.gabriel.urdhr.fr/2015/05/29/core-file/.

WebAssembly / tool-conventions

Coredumps and multiple wasm modules #204