Save WASM state and resume

IamTheCarl commented 3 years ago

Feature

To be able to interrupt and save the state of executing WASM code and then resume it later, possibly on a different host machine.

Benefit

I am writing an add on for the Open Computers mod in Minecraft. The mod adds tiny computers that can be programmed by the player. One of their features is that if you quit the game, when you resume later, the state of the machine is restored just as they left it.

The default Lua interpreter uses a custom implementation of Lua that can save and restore its state like this. There's a small hand full of emulators that just save their memory and registers and restore them later. I don't see a clear way to do this with wasmtime.

I do see potential in this being useful for a tool similar to Jupyter. You could pause a computationally intensive task and resume it later, or a cloud could serialize a job, and send it to another machine to resume in the case of a scale up/down.

Implementation

I can see that modules can be serialized already. What really needs to be saved is the state of memory and execution. Just dumping the memory into a byte array will be enough to save the memory. Saving the state is what looks hard to me. Dumping the registers to be restored later could work for a non-portable solution but something that can be restored on another machine of possibly a different architecture would be preferred.

Unfortunately I don't know enough about wasmtime's implementation to give a fantastic recommendation here.

Alternatives

This is actually my fallback plan if the feature is rejected.

If you can just serialize the memory of the WASM environment (actually this may already be possible, I just haven't tried yet) you could push the job of tracking state off onto a Rust module's async state machine. You would need a function that is called regularly to resume this state machine and the module would need to frequently interrupt itself so that this function can return to the host.

The advantage here is that you don't need to worry about any kind of register. Saving the state is handled by the async state machine and you can just call that one function to run the state machine to resume the guest application.

The disadvantage is that this is really only practical if the web assembly is written in Rust, and the person writing this Rust code must correctly write an async application.

bjorn3 commented 3 years ago

To be able to interrupt and save the state of executing WASM code and then resume it later, possibly on a different host machine.

The on a different host machine will require pessimizing optimizations such that between every instruction all locals and the full stack is known as different backends may optimize instructions in different ways and thus require different information to at a given point continue running. This will probably a significant perf hit. Alternatively interrupting could be limited to certain safepoints, allowing a much smaller perf hit.

The default Lua interpreter uses a custom implementation of Lua that can save and restore its state like this.

It is much easier to implement this in an interpreter than a compiler as an interpreter needs to keep all state, while a compiler will "forget" about state as soon as possible to reduce register pressure and will try to fold instructions together (and thus allow forgetting about certain locals) or mangle them in different ways.

IamTheCarl commented 3 years ago

The on a different host machine will require pessimizing optimizations such that between every instruction all locals and the full stack is known as different backends may optimize instructions in different ways and thus require different information to at a given point continue running. This will probably a significant perf hit.

I was worried that would be the problem.

Alternatively interrupting could be limited to certain safepoints, allowing a much smaller perf hit.

I could easily trick the user into having regular safe points by requiring them to regularly check into something like a watch dog timer. How would you suggest saving the state at these safe points?

bjorn3 commented 3 years ago

How would you suggest saving the state at these safe points?

That will still require changes to Cranelift to allow for saving the state and restoring at such a safe point. The changes are just easier and with less of a perf hit than allowing it at arbitrary points.

cfallin commented 3 years ago

@IamTheCarl thanks for starting this conversation -- it's a really interesting one!

I'm curious what sort of state-saves are actually necessary. As @bjorn3 says, very general snapshotting at arbitrary points is expensive; if we needed to do something like that, probably the better way would be to just save the registers, heap and stack, ensure we restore into the same generated code (this has limits wrt CPU features too, as mentioned above), and take care to make the stack and register file "relocatable", in the sense that they can be fixed up for other heap-base addresses on restore. (Some thoughts on that: to do so, probably the easiest way would be to tweak codegen so that the heap base always lives in one pinned register, and any address computation uses that as part of its address expression, so we never have intermediate pointers in registers or spilled to stack. Then we just have to fix up frame pointers and return-area pointers on the stack.)

But there's a potentially much easier way: could we get away with only snapshotting when no Wasm frame is on the stack? In other words, when a call into the Wasm has returned all the way out? If so, then we don't have to worry about register or stack state at all; we can just snapshot the heap and globals. We could even restore with different generated code, on a different architecture or just with different CPU features.

This is sort of like what Wizer does, so that's probably the place to look for more ideas. cc @fitzgen for more thoughts!

IamTheCarl commented 3 years ago

That does sound a bit like my alternative solution.

The idea was to regularly call a function that would let the guest run for a short bit and then return shortly after. It would be up to the developer of the guest code to keep track of the state between calls. This is inconvenient for the guest developer, since they need to write an application that can run in such an environment (some global mutable variables will be needed). In Rust, using async can make this pretty intuitive, but other languages like C would require they manually build their own state machine.

Of course if we're taking it that far I could then even push off the job of serialization to the guest code. Just send it some kind of event "shutdown now or you'll be force terminated" and that'll give them just a moment to save their state before shutdown. This one is so simple to implement that wasmtime wouldn't even need any modifications, but it is also the most inconvenient for the developer of guest code.

fitzgen commented 3 years ago

This is sort of like what Wizer does, so that's probably the place to look for more ideas. cc @fitzgen for more thoughts!

Yeah, saving state without frames on the stack is pretty easy, and as Wizer shows, you don't even need to build that functionality into the engine itself, you can do it with the standard reflection APIs that engines expose and use Wasm itself as your snapshot format. You could even factor out the code from the data (snapshot) and make it so that you don't need to recompile the code for each snapshot, just reuse the already-compiled module from before and pass in new memory/globals for its imports.

Binaryen's asyncify pass can help turn synchronous code into asynchronous code. If my understanding is correct, it lifts frame activations and their local values into linear memory and does some kind of CPS-esque transformation on the code. It might make assumptions about the Web or that you're also using Emscripten; not sure, I haven't actually used it. This might be one way to transform the Wasm so that code can be written synchronously but still make snapshotting with zero activations on the stack viable.

One final thought: snapshotting at ~gc safe points could be made possible relatively easily if we spill all registers at safe points. This would be a config option, because we wouldn't want to do this unconditionally. But if we only have to capture and restore stack values, that seems like a much easier problem. Hard part is addressing native pointers. I don't have any good ideas here, but I think you may have some, @cfallin.

cfallin commented 3 years ago

I think you may have some, @cfallin.

Indeed! This is the bit about "no intermediate pointers in registers or the stack" alluded to above. To add a bit more detail, just to get this written down:

We need to be a little careful in legalization: where we have loads/stores that decompose into either heap_addr or stack_addr and a native-pointer load/store, we need to at least keep the ops close in the CLIF so that they can be pattern-matched, or ideally do codegen directly on the heap/stack-level ops without decomposing them.
The goal is to turn every Wasm heap access into e.g. ld rA, [rBase + rB] where rB is a 32-bit Wasm pointer, and rBase is a fixed register that we decide and make non-allocatable. We make this part of an internal Wasmtime ABI, and set it before calling into Wasm. (This implies a little care taken w.r.t. hostcalls too but that's manageable I think.)
Similarly, every stack-slot access turns into ld rA, [sp + ...]; we never copy sp to another register and compute an address based on it.
We need to tweak the internal ABI to deal with return-area pointers (for multiple return values) slightly differently. Probably we just use a fixed offset from frame at entry, rather than passing an implicit arg with the pointer.
Finally, either codegen without a frame pointer, or be prepared to fix up the frame-pointer chain on relocation.

Given those codegen changes, the code is completely "data-relocatable": the only native-address-space pointers in the register file or on the stack at any time are in sp and rBase. If we set those to different values on Wasm re-start after moving data, we should be just fine.

(Caveat: return addresses; if code is relocated too, probably best to retain frame-pointer chain, rewrite return addresses as relative to module base on snapshot, and re-add the new module base on restore.)

Probably a few weeks' work but would be very useful even beyond this use-case!

bjorn3 commented 3 years ago

This is the bit about "no intermediate pointers in registers or the stack" alluded to above.

Intermediate pointers are fine if they aren't used across a safe point or if it is the vmctx. In the former case they simply wouldn't need to be stored and in the later case the vmctx value can (and should) trivially be replaced with the newly allocated vmctx when loading.

We need to tweak the internal ABI to deal with return-area pointers (for multiple return values) slightly differently. Probably we just use a fixed offset from frame at entry, rather than passing an implicit arg with the pointer.

If this parameter is properly annotated it can be fixed up when loading too by looking at the caller.

Finally, either codegen without a frame pointer, or be prepared to fix up the frame-pointer chain on relocation.

The stack doesn't need to be copied verbatim. Only the known locations of spilled values, explicit stackslots (unused by wasm) and the return value need to be copied. The later can be stored as function index + callsite index.

bjorn3 commented 3 years ago

Probably a few weeks' work but would be very useful even beyond this use-case!

Yeah, might be useful for on stack replacement for eg speculative optimization or going from an interpreter to jitted code.

IamTheCarl commented 3 years ago

I am impressed with both the quick reply and eagerness to find solutions in this team. I'm grateful.

There's a lot of terminology here I'm unfamiliar with, so I'd like to check and make sure I'm following this conversation correctly. It sounds like your plan at the moment is to take a non-async WASM module and transform it into an async form before it even hits the JIT. I assume this will be implemented somewhere in the IR layer. That's pretty cool.

It also sounds like you're having difficulty knowing how to serialize objects on the heap? I can understand why that would be an issue since you can't depend on getting those same addresses back.

So what comes next? I'd like to help but I feel I have a lot to catch up on to make a meaningful contribution.

cfallin commented 3 years ago

@IamTheCarl it spawned an interesting discussion and one that's timely for other potential uses too, so thank you for starting it!

non-async Wasm module and transform it

Yes, this is probably the "easiest" solution (for some definition of easy) -- as @fitzgen suggested above, the Binaryen toolchain has an asyncify transform that might help turn arbitrary Wasm into something that would work with your "call into and execute a bit of Wasm at a time" approach; between those calls, all you need to save is the Wasm heap and globals, not any in-progress execution state such as registers or stack. I don't know much about this tool but it looks like there's a pretty comprehensive intro blog post here: https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html

The next step after that is to modify Wasmtime itself so that it generates code that allows for snapshotting and restoring, including the in-progress execution state. There are various tradeoffs here involving whether we want to allow restoring on a different machine architecture (hardest), or just with identical JIT'd code (easier); and whether we want to allow arbitrary interruption and checkpoint at any point (hardest, possibly pessimizes codegen), or checkpoint only at certain points, which we're calling "safepoints" above, in reference to some garbage-collector terminology (a bit easier). Anything along these lines though would not be a quick short-term solution for you, so I personally would suggest looking into the asyncify option, or perhaps just defining an execution model where control always returns from your Wasm module periodically.

RobDavenport commented 2 years ago

But there's a potentially much easier way: could we get away with only snapshotting when no Wasm frame is on the stack? In other words, when a call into the Wasm has returned all the way out? If so, then we don't have to worry about register or stack state at all; we can just snapshot the heap and globals.

Does a method to do this currently exist? I'm exploring various wasm runtimes and have been unable to find such a feature supported as of yet. If not, how difficult would it be to implement? I may be able to take a look at it as it seems to be a heavily requested feature.

bjorn3 commented 2 years ago

Isn't that what wizer does?

RobDavenport commented 2 years ago

Ahh silly me - I did a google search on wizer after seeing it mentioned before and came up with a completely different unrelated result. But yes it looks like it's quite similar to what I'm looking for. While the caveats don't work for my specific use case, (specifically, not being able to call imported functions during the init function), it does provide enough of an idea of how to proceed for now.

cfallin commented 2 years ago

Indeed, Wizer came up in the above discussion as well; the main difference in my mind (at least as far as I can remember now) was whether this would be a more generic facility, with an API to snapshot/resume at arbitrary points. Wizer currently runs a wizer.initialize export, and doesn't allow calls to imports (or can allow WASI, as a special case), so it's targeted toward a "precompute some state once" use-case. I imagine it could be extended to be more general, of course!

Interestingly there was a recent PR from @koute (#3691) that did snapshotting as well. The specific use case in that PR should I think be mostly covered by our recent instantiation-time improvements. But if snapshotting in a more general sense continues to arise as a need, at the Wasmtime API level, that could merit more discussion, I suppose.

@RobDavenport I'm curious about your use-case here, on a few axes: (i) do you need continued snapshot/restore (i.e. multiple roundtrips), or just one initial snapshot and then restores from that? And (ii) does it need to be programmatic at the Wasmtime API level, or is a separate tool (e.g. Wizer) usable for you?

RobDavenport commented 2 years ago

Thanks for the question @cfallin , in my case I'm writing something like a game engine and using WASM as my game-logic code. As a need to support rollback multiplayer (like GGPO which is often used in emulators), I need to consistently store the entire state of the game for the past few frames, and, if a mis-prediction is detected, rollback to a confirmed synchronized state and re-simulate the rest of the frames (often more than one!) with the new input up until the current time.

I was actually able to get this working for the most part by following the suggestions in this thread, along with some things from Wizer. Iterating through the instance's exports, cloning the memories and non-const globals, and then just saving them out as a Box<[u8]> via data_unchecked() or as a WASM Value enum along with the related keys.

Loading saved globals is very straightforward and doesn't need any explanation. For memories, doing a copy_from_slice from the snapshot back into the "hot" memory, using the length of the snapshot so the copy function doesn't panic. I believe the snapshot's reserved memory size should always be equal to or less than the "hot" memory, since the WASM's host memory can only grow for a particular instance, and since that snapshot was an actual state the VM was in at one point, the internals should have already allocated enough space for it.

I'm not sure how dangerous this might be since the WASM client script may be allocating or deallocating memory, but I'm under the assumption that since it's all lumped together and already allocated inside the WASM host VM then it should just work unless the instance or store are moved somehow.

So to answer your questions... (i) Yes, multiple trips will be made, however at specific times determined by the host program without any call frames on the stack. These will always be going back in time, though, and then re-simulating and rewriting old snapshots as we progress. (ii) Actually I have no preference, as I was able to do it for my use case mentioned above. The reason I couldn't use Wizer in my case was I needed to allow the WASM code to make callbacks into the host for things like input handling or draw requests.

However, I believe the original requester for this specifically wanted a way to be able to save/resume the WASM state on a fresh application or even potentially another machine. Unfortunately I'm not too familiar with how the internals work so I can't comment on the safety or feasibility of that, as my quick hack is only intended to work with a single module instance, on the same machine, during the same application lifetime.

IamTheCarl commented 1 year ago

I'm curious if much work has been done on this. I have another toy project idea where a feature like this could be useful.

I saw that there now appears to be an async interface for wasmtime which I imagine would be useful for a feature like this.

cfallin commented 1 year ago

Not much has changed regarding the fundamentals here since the last comments; Wizer is still the state-of-the-art, and unfortunately we don't have anything built into Wasmtime itself for snapshotting.

ZheniaZuser commented 5 months ago

How hard it would be to make WASI WebGPU and File system (assuming the FS will never be changed by any other process) to work well with this (properly resume from saved state)?

This sounds like it may be useful for workaround for Android's background app killing: An Android app that is made of wasm runtime and the wasm file that has most of the actual app's code. As soon as it goes to background, it would pause execution of the wasm (and its linked wasm-plugins, if any), and snapshot the state as soon as it goes to background, and then restore when user returns to the app.

bytecodealliance / wasmtime