Support for Wasm Coredump

xtuc commented 1 year ago

Feature

When the Wasm instance traps, it's sometimes difficult to understand what happened. Post-mortem debugging using coredumps (which is extensively used in native environment) would be helpful for investigating and fixing crashes.

Wasm coredump is especially useful for serverless environment where production binaries are stripped and/or have access to limited logging.

Implementation

Implement Wasm coredumps as specified by https://github.com/WebAssembly/tool-conventions/blob/main/Coredump.md. Note that the spec is early and subject to changes. Feedback very welcome!

cc @fitzgen

bjorn3 commented 1 year ago

Reading the linear memory after a crash is already possible. As for getting the locals and stack values, this is much more complicated. Wasmtime uses the Cranelift optimizing compiler, which can eliminate locals and stack values entirely and leaves those that remain at whichever location it likes. It did be necessary to somehow prevent optimizing locals away, at least for points where a trap could happen. There is debugger support for getting the location of locals and stack values which aren't optimized away to generate debuginfo, but I'm not sure if it is 100% accurate. By the way https://github.com/bytecodealliance/wasmtime/issues/5537 is somewhat relevant to this.

xtuc commented 1 year ago

I don't think Wasm coredump should prevent optimizations, given that ideally it's enabled by default.

It's not uncommon to see coredump in native environment with missing values because they were optimized away. They are usually not very helpful for debugging.

bjorn3 commented 1 year ago

The wasm coredump format doesn't seem to allow omitting values that are optimized away, but if it is allowed, then it should be possible to implement without too much changes to Cranelift. I think it would need some changes to the unwind table generation code to store the location of callee saved registers, but that will need to be done anyway for handling exceptions. After that I guess it would be a matter of telling Cranelift to generate debuginfo and then during a crash unwind the stack and record all preserved locals and stack values for every frame from Wasmtime.

xtuc commented 1 year ago

The wasm coredump format doesn't seem to allow omitting values that are optimized away

Correct, at the moment it doesn't. I'm going to add it, thanks for your input!

jameysharp commented 1 year ago

This is an area I haven't dug into much, but doesn't Cranelift's support for GC already support tracking the information we need for this? I think we would need to mark potentially-trapping instructions as "safe points" and then request stack maps from Cranelift. And my impression was that calls are already considered safe points. But this is all conjecture based on a CVE that I was peripherally paying attention to last year, so I could have it all wrong.

fitzgen commented 1 year ago

This is an area I haven't dug into much, but doesn't Cranelift's support for GC already support tracking the information we need for this? I think we would need to mark potentially-trapping instructions as "safe points" and then request stack maps from Cranelift. And my impression was that calls are already considered safe points. But this is all conjecture based on a CVE that I was peripherally paying attention to last year, so I could have it all wrong.

Stack maps only track reference values (r32/r64), and only say which stack slots have live references in them. They do not supply any kind of info to help tie that back to wasm locals or even clif SSA variables.

I don't think we would want to use stack maps for this stuff.

cfallin commented 1 year ago

On the flip-side, if you're proposing altering the generated code to assist debugging observability @jameysharp, there is a large design space that we haven't really explored. A relatively simple change would be to define a pseudoinstruction that takes all locals as inputs, with "any" constraints to regalloc (stack slot or register), and insert these wherever a crash could happen. This "state snapshot" instruction would then guarantee observability of all values, at the cost of hindering optimization.

This goes somewhat against the "don't alter what you're observing" principle that is common in debug infrastructure but I'll note that we do already have some hacks to keep important values alive (in this case, the vmctx, which makes all other wasm state reachable) for the whole function body.

There's also the "recovery instruction" approach, used in IonMonkey at least: whenever a value is optimized out, generate a side-sequence of instructions that can recompute it. That's a much larger compiler-infrastructure undertaking but in principle we could do it, if perfect debug observability were a goal.

xtuc commented 1 year ago

https://github.com/WebAssembly/tool-conventions/issues/198 has been closed. The coredump format now allows to mark local/stack values as missing.

xtuc commented 1 year ago

I made a change to add initial/basic coredump generation: https://github.com/bytecodealliance/wasmtime/pull/5868. Could you please have a look and let me know if this is the right direction? It uses WasmBacktrace for information about frames.

xtuc commented 1 year ago

Basic coredump generation has been merged (thanks!).

Now, to have the complete debugger experience, we need to collect the following information:

Wasm locals of each stack frames
Snapshot the Wasm linear memory (sounds relatively easy, it's not clear to me where the coredump code should live though).

RyanTorok commented 1 year ago

Is there a chance we could revive this thread? I'm working on cloud infrastructure research, and being able to take a stack snapshot in wasmtime would allow us to get some sophisticated cold-start optimizations for Function-as-a-Service (FaaS) functions.

There has been a plethora of academic papers published about using execution snapshots to speed up the cold-start (startup) time in FaaS, especially when heavyweight VMs are involved. Starting up a Module in wasmtime tends to be faster than VMs by 2-3 orders of magnitude, but recent papers have also explored how to snapshot the state of the function after some initialization runs, which has a lot in common with what Wizer does.

I am trying to extend this idea with a construction called Nondeterministic Generators, which will allow FaaS functions to be snaphotted at any point in the execution. Generators rely on the observation that functions whose execution that has not performed any invocation-specific computation (i.e. anything using the function arguments or any nondeterministic functions imported from the host) can be unconditionally snapshotted and used to fast-forward future invocations of the same function.

In addition, we can create conditional snapshots that let application developers optimize for common patterns, such as functions that want to check that their arguments are valid before they perform their expensive initialization, which traditional "init function"-based cold-start speedup techniques cannot optimize without breaking the function semantics if the invocation-specific invariant is violated (e.g. our argument validation fails).

I was looking into Wizer quite a bit and the design decisions it makes, and I was hoping to get some insight about the requirements Wizer lists on its docs.rs page, under "Caveats":

The initialization function may not call any imported functions. Doing so will trigger a trap and wizer will exit.

Is this just a lint against the produced module being potentially non-portable (the snapshot would rely on the outcome of a particular host's implementation of the imported function), or is there a more fundamental reason this is not possible? I imagine my generator design having the potential to snapshot any time just before a generator is polled (polling calls an import function, so the host can record the outcome of the generator function), which would necessitate snapshotting after code that has already called into the host at least once if we have multiple generators.

The Wasm module may not import globals, tables, or memories.

I don't anticipate the application code running on my system to need any of these, but I'd like some clarification about why this applies to the entire module and not just the init function, like for host functions.

Reference types are not supported yet. This is tricky because it would allow the Wasm module to mutate tables, and we would need to be able to snapshot the new table state, but funcrefs and externrefs don’t have identity and aren’t comparable in the Wasm spec, which makes snapshotting difficult.

This makes sense. Application code in my system should not need to use these.

More fundamentally, the major roadblock to my design working with WebAssembly modules is wasmtime's current inability to snapshot the WebAssembly stack. Since my design allows the execution to snapshot at any point, not just after some initialization function runs (as Wizer supports), my design would require all the application's local state to be moved to a Memory before we snapshot, which would slow down function execution and be a very awkward paradigm to program in.

My main question is (and I apologize for taking a page to get there), is what roadblocks would need to be overcome in order to make stack snapshots possible in wasmtime? Since it will be relevant below, I should point out that the requirements for my use case are actually a bit looser than Wizer's in two ways:

I don't necessarily care that a snapshot is actually in the form of a new WebAssembly Module that can be instantiated and run on its own. I just want my host to be able to store something that lets it fast-forward a module to the point where a snapshot occurred, possibly by instantiating the original module and overwriting the globals, memory, and stack. Likewise, I'm not concerned about portability of the snapshot. We can assume that the snapshot will be loaded on the same Engine (and therefore the same version of wasmtime) it was produced on.
I don't require the host to necessarily do all the snapshotting work on its own. If we can invoke a callback that allows the application, through a library they link to, to, say, copy the stack to a Memory object so it can be snapshotted, that should suffice.

I had the intuition that the application library could just run some WebAssembly code that copies the locals on the stack into a Memory object, but I was concerned about how wasmtime would behave when we restored such a stack. Unlike the core-dumping use case, I'm less concerned about the actual contents of the stack in relation to cranelift's dead-code elimination (DCE); however, I am concerned that if during the run that produced the snapshot, cranelift decides by DCE to eliminate an unnecessary value from the stack, is it possible that when we restore that stack in a new instantiation of the module that skips to the snapshot, cranelift won't perform the same optimization and it will try to pop a value off the stack that isn't there? If I had one reason for writing this comment, it's that I would really appreciate some clarification on how this compilation process works and what guarantees are in place, and how that might affect our endeavor to produce restorable stack snapshots.

Thanks everyone for reading. You all do great work, and I'd love to contribute going forward.

bjorn3 commented 1 year ago

Something that may work is if you reuse the exact same compiled machine code then you could take a snapshot of the part of the native stack that contains the wasm frames and restore it later. You did have to fixup pointers (which probably requires emitting extra metadata and maybe some changes to avoid keeping pointers alive across function calls) and making sure that no native frames are on the stack as those can't safely be snapshotted. By keeping the same compiled machine code you know that the stack layout is identical. Wasmtime already allows emitting compiled wasm modules (.cwasm extension) and loading them again. You did only need to implement the stack snapshotting and pointer fixups. This still not exactly trivial, but likely much easier than perfectly reconstructing the wasm vm state.

The initialization function may not call any imported functions. Doing so will trigger a trap and wizer will exit.

I would guess this is a combination of there being no way to hook up any imported functions from the host to wizer and this limitation ensuring that there is no native state that wizer can't snapshot. But I'm not a contributor to it, so it is nothing but a guess.

cfallin commented 1 year ago

@RyanTorok there are a lot of interesting ideas in your comment (I have to admit that I skimmed it in parts; I'd encourage a "tl;dr" of points for comments this long!). A few thoughts:

The fundamental issue that would have to be solved to snapshot and restore an active stack is relocation of mappings in the address space. In principle one could copy an image of the whole stack and all data segments, current PC and all registers, map them into a new execution at a later time and restart as if nothing changed... except that the heap and the stack will be at different locations than before.

In order to make that work, one has to prevent "host" addresses from escaping, or else precisely track where they escape to, or some combination. An example of the latter is the frame-pointer chain: one has addresses that point to the stack on the stack itself, but that's OK because one can precisely traverse the linked list and rewrite saved FPs if the stack moves. Likewise for return addresses. An example of the former is handling Wasm heap accesses. If we somehow ensure that only Wasm-level addresses (offsets to the heap) are "live" at snapshot points, and the only live address is the vmctx, except ephemerally when addresses for all other accessed memory are derived from it, then that could work. But that requires some compiler support, I think.
Restoring a native-level snapshot after optimizing the code a different way is a complete non-starter, I think. (I believe this is what you're referring to when speaking of Cranelift DCE working differently in a different run.) Many incidental details of the compiled code can change if the input changes: the layout of blocks, the registers and stackslots that the register allocator assigns for particular values, existence of some values in the function causing optimization of different values to go differently, etc.
Another option that I think you refer to is a Wasm-level snapshot. This is interesting, but requires mapping Wasm-level state to machine state precisely at possible snapshot points. We have a little bit of plumbing for that kind of thing with our debug support, but it's incomplete. The other side of the coin -- restoring the snapshot -- then requires "multi-entry functions" (something like "on-stack replacement" when a JIT tiers up) to enter into the middle of the IR with known values.

So I think some form of this is possible but it's a deep research project and requires a bunch of intimate knowledge of the compiler and runtime. We likely don't have the resources to help you design this in detail, but I'm personally curious to see what you come up with...

fitzgen commented 1 year ago

@RyanTorok,

The Wasm stack doesn't really exist anymore by the time Cranelift is done emitting machine code (it is erased very early in the pipeline, basically the first thing to go). Instead you would need to capture the actual native stack. This has issues that @bjorn3 mentioned around native frames in between Wasm frames, but even if it is just Wasm there will be pointers on the stack to things malloced by the host, namely the vm context and associated data structures. Each new process will have new ASLR and new malloc allocations and new FaaS requests/invocations will have new stores (and their associated vm contexts). These structures will ultimately end up in different addresses in memory. So either (a) restoring a snapshot will require having a list of places to go and update pointers not dissimilar to relocs or a moving GC, or (b) take extreme care codegen only emit indirect references to these structures (somehow? need an actual handle to be the "root" at some point or else a host call or something). Option (a) is a ton of work for Wasmtime/Cranelift to keep track of these things and option (b) is also a ton of work but also makes Wasm execution speed much slower. In both cases, if we get anything wrong (miss a stack slot or register that has a native pointer when saving a snapshot or accidentally emit a direct pointer reference rather than an indirection) then we have security vulnerabilities. Supporting all this would be a large refactoring of much of Wasmtime and Cranelift, and I'm pessimistic that it would ever happen. This is the kind of thing that you ideally need to build in from the very start, and Wasmtime and Cranelift have not been built with this in mind.

Backing up a bit: this topic would be better discussed in a dedicated issue or on zulip, since this issue is specifically about implementing the proposed standard Wasm coredump format, which won't help with this feature since it is strictly about the Wasm-level. I suggest filing a new issue or starting a thread on zulip if you have further questions.

RyanTorok commented 1 year ago

Thank you to everyone for the quick responses and insightful comments!

TL;DR: Issues with ASLR and the level of introspection into the runtime that would be required make stack snapshots pretty much a non-starter, and in fact they alerted me to limitations in the existing work on cold-starts I wasn't aware of.

Based on @fitzgen 's comments about ASLR, I took another look back at the existing literature on cold-starts, and it turns out that the traditional method of snapshotting the entire state of the VM or language runtime is not compatible with ASLR at all, and for the exact reason @fitzgen pointed out.

A summary of the problem is that language runtimes (e.g. JVM, Python, Node.js, wasmtime, ...) inherently need to compile code using native addresses, thereby making the VM state not portable to different addresses. Traditionally, the way to deal with this portability issue would be to introduce another level of indirection (i.e. position-independent addresses), but @fitzgen, @cfallin, and @bjorn3 all pointed out that any such scheme would require very deep introspection into the language runtime to convert the indirect addresses to direct addresses, which would be an enormous endeavor to the point you'd be better of redesigning the entire runtime to support this indirection. Otherwise, you're really walking a tightrope on both performance and security (mess up the indirection once, and the tenant can read memory their program doesn't own).

The existing literature on cold-starts essentially punts on this issue; it requires all memory owned by the VM or runtime to be loaded at the same address every time. While I don't see any major reasons wasmtime couldn't support this from an implementation standpoint, I don't recommend this as a direction for multiple reasons:

Disabling ASLR is potentially bad for security. While I'm not aware of any features of language runtimes that fundamentally depend on ASLR to ensure security, disabiling it would make any memory bugs much easier for the tenant to exploit, because the attacker could just hard-code addresses in their code, or, short of that, memorize them from a previous run using the same snapshot.
Security aside, in the cloud space, requiring code to always occupy the same address ranges every time would add unwanted contention to multi-tenant systems (i.e. cloud infrastructure). If two functions each had even a single (native) memory page that required the same fixed address, the host could not run both functions in parallel. One possible mitigation to this would be to spawn multiple processes, so the functions would not compete for the same virtual addresses, but not only does this introduce overhead of interprocess communication (IPC), in wasmtime's case, this would force us to choose between reverting back to OS-based lazy loading of pages (with mmap), rather than preallocating pages using userfaultfd, or becoming a serious memory hog by preallocating a userspace page cache for all N processes, neither of which would be worth the performance wins of more flexible snapshots.

To summarize (in research paper speak), there are several open problems that have to be addressed with language runtimes in general, not just wasmtime, in order for generalized snapshots to be a practical solution for the cloud. I'm going to continue looking into how we might provide a subset of this feature set via library abstractions that work with the designs of existing language runtimes.

Thanks for all your help everyone!

RyanTorok commented 1 year ago

As an aside, I think this question from my original comment:

is it possible that when we restore that stack in a new instantiation of the module that skips to the snapshot, cranelift won't perform the same optimization and it will try to pop a value off the stack that isn't there?

was a simple misunderstanding by me about the mechanics of cranelift. Clearly everything has to be compiled in order to run, it's just a matter of when that happens (AOT or JIT). My last project was in browser security, and in JavaScript engines we actually have to worry about code running at multiple optimization levels, and my confusion stemmed from there. This doesn't change anything about the issues with ASLR or introspection, however.

whitequark commented 5 months ago

What tools can I use to inspect the coredumps?

fitzgen commented 5 months ago

@whitequark unfortunately there isn't much off-the-shelf at the moment.

There was https://github.com/xtuc/wasm-coredump/tree/main/bin/wasmgdb but as far as I know it only works with an old version of the format.

There are plans to build support for inspecting them via the debug adapter protocol in Wasmtime itself, as a stepping stone towards fuller debugging capabilities. See https://github.com/bytecodealliance/rfcs/pull/34 for more details. Unfortunately, that doesn't exist yet.

In the meantime, Wasm's core dumps are just wasm modules themselves, so you can use any tool that you might inspect a wasm module with to get at the information inside a core dump, e.g. wasm-tools print or wasm-objdump.

I know this isn't a great answer. I wish I had a better one. But we are planning on getting there!

whitequark commented 5 months ago

Thanks! I'll keep it in mind--I have to use wasm-objdump a lot already so, cursed as it is, this does fit into my workflow...

xtuc commented 5 months ago

There was https://github.com/xtuc/wasm-coredump/tree/main/bin/wasmgdb but as far as I know it only works with an old version of the format.

Sorry about that. I'm planning to update wasmgdb to the latest spec but haven't had the time yet.

bytecodealliance / wasmtime

Support for Wasm Coredump #5732

Feature

Implementation