Fold Wasm<--->host trampoline functionality into component trampolines

fitzgen commented 2 years ago

This is a follow up to https://github.com/bytecodealliance/wasmtime/pull/4431

In that PR we don't save entry SP and exit FP/return pointer for calls into/out of components because they use a different set of trampolines. However, simply saving the entry SP and exit FP/return pointer isn't something we can simply add to the existing component trampolines because they are defined in CLIF and CLIF doesn't have a way to talk about these particular architecture-specific details. Mach insts do via operand constraints given to regalloc, but CLIF itself doesn't. So we would need to either have two layered trampolines that bounce from the first to the second when calling into / out of components (very not ideal) or we need to add an instruction to CLIF or something to grab the current SP/FP/return pointer (probably we should do this, but it requires some thought/design).

fitzgen commented 2 years ago

Also when fixing this we need to re-enable the attempt_to_leave_during_malloc component model test.

alexcrichton commented 2 years ago

To elaborate a bit more on the issue here -- this will be a repeat for me/@fitzgen but wanted to write stuff down anyway.

The stack unwinding in #4431 relies on precisely knowing the stack pointer when we enter WebAssembly along with the frame pointer and last program counter when we exit WebAssembly. This is not generally available in Rust itself so we are relying on handwritten assembly trampolines for these purposes instead.

Entry into WebAssembly

Entry into WebAssembly happens via one of two routes:

A "typed" route using the wasmtime::TypedFunc API or when invoking an core instance's start function (which has a known fixed signature of no inputs and no outputs). In these cases Rust does an indirect call directly to the Cranelift-generated code for the corresponding wasm function.
An "untyped" route which is used by wasmtime::Func::call as well as wasmtime::component::{Func,TypedFunc}::call. In this situation Rust will call a Cranelift-compiled trampoline. The Cranelift trampoline will load arguments from a stack parameter and then make an indirect call to the actual Cranelift-compiled wasm function which is also supplied as an argument.

Today this all records the entry stack pointer via the host_to_wasm_trampoline defined in inline assembly. Concretely Wasmtime will "prepare" an invocation which stores the Cranelift-generated function to call (be it a raw function in case (1) or a trampoline for case (2)) into the VMContext::callee field and then invoke the host_to_wasm_trampoline inline asm symbol.

This entry isn't too relevant to the component model since we're already doing what's necessary for the stack unwinding, recording the sp on entry. Nevertheless I want to describe the situation so I want to describe some oddities here as well:

The actual trampoline used in (2) to load arguments from the stack is not actually always defined by Cranelift. Instead sometimes it's a monomorphized Rust function host_to_wasm_trampoline from the Func::wrap API. This means we unfortunately cannot rely on Cranelift to supply all these trampolines which means we can't rely on the trampolines to do things that Rust itself can't do.
The entry trampoline currently requires the ability to tail-call to the actual callee. This is a technical limitation due to using the exact same trampoline for every single entry point, regardless of signature.

Ideally we would always enter WebAssembly via a Cranelift-compiled trampoline. That would mean we could do anything in the trampoline that Cranelift would do and ideally remove the need to have inline asm for this. We might still need multiple trampolines for untyped entry points and typed entry points, but overall we should ideally be able to do better here.

Exiting WebAssembly

Exiting back to the host happens in a few locations, and this is the focus of this issue where it's missing support in the component model:

Exiting from core wasm will either end up in something defined by Func::wrap or Func::new (roughly). Both of these use a VMHostFunctionContext which internally has two function pointers. One is the VMCallerCheckedAnyfunc which wasm actually calls and the other is the actual host function pointer defined in Rust being invoked. The function pointer contained within the VMCallerCheckedAnyfunc is a trampoline written in inline assembly which spills the fp/pc combo into VMRuntimeLimits. The function pointer to invoke contained within the VMHostFunctionContext has the "system-v ABI" since it receives arguments in native platform registers. For Func::wrap this is a Rust function and for Func::new this is a Cranelift-generated trampoline which spills arguments to the stack and then calls a static address specified at compile time (using Func::new requires Cranelift at runtime).
Exiting from a component will always exits via a lowered host function. Concretely what happens is that a VMComponentContext has an array lowering_anyfuncs: [VMCallerCheckedAnyfunc; component.num_lowerings]. This array is what core wasm actually calls and is exclusively populated by Cranelift-compiled trampolines (via compile_lowered_trampoline). These trampolines are similar to the Cranelift-compiled trampolines for Func::new but call a host function of type signature VMLoweringCallee. This is where fp/pc are not recorded while we exit wasm. There's not clear way to use the same trick as Func::{wrap,new} which have a singular inline asm trampoline for all signatures since the callee to defer to depends on the LoweringIndex.
Finally exiting wasm can also happen via libcalls implemented in Wasmtime. Currently each libcall gets a unique inline-asm-defined trampoline that records the pc/fp combo and then does a direct tail-call to the actual libcall itself.

Proposal to fix this issue

Overall I find the current trampoline story as pretty complicated and also pretty inefficient. There's typically at least one extra indirect call for all of these transitions and additionally there's very little cache-locality. The fix I'm going to propose here isn't a silver bullet though and will only solve some issues, but I think is still worth pursuing.

I think we should add few new pseudo-instructions to Cranelift:

Something to get the current frame pointer
Something to get the current stack pointer
Something to get the return address of the current function
Something to get the address of a label in a function (this may already exist, not sure)

With these tools we can start trying to eventually move all of the trampolines above to Cranelift exclusively and remove both Rust-defined and inline-asm defined trampolines:

For components, and this issue, compile_lowered_trampoline could be updated to use the cranelift instructions to record the pc/fp combo into the VMRuntimeLimits. This would remove the need for any extra trampoline when exiting a component and would solve the issue at hand.
For libcalls we could use the cranelift instructions to manually save fp/pc just before a libcall out to the runtime. This would remove all trampolines related to libcalls.
For Func::new the cranelift-generated trampoline could act similar to compile_lowered_trampoline and store the fp/pc combo to VMRuntimeLimits and avoid the need for two trampolines.
Untyped host-to-wasm trampolines could do the sp-saving internally rather than relying on the external trampoline to do so.

Those are at least the easy ones we could knock out with more Cranelift features. Otherwise there are still a number of places that we are requiring trampolines:

Exit trampolines with Func::wrap could ideally be generated by Cranelift but would still require two indirect calls. One call to get to the trampoline from the original core wasm and then a second call from the trampoline to the host function itself. The main problem here is getting a trampoline. Assuming trampolines are provided by Cranelift then they become available at runtiem when modules are loaded, which means Func::wrap needs to, at some point, dynamically look up a trampoline and find a corresponding one in a previous module's compiled image. This is not trivial.
Entry trampolines to TypedFunc are similarly somewhat nontrivial, but I think surmountable. Today a Store has a registry of untyped trampolines per-function signature, and I think it could also have a registry of typed trampolines per-function signature. This typed trampoline would then be used to enter wasm instead of today's calling the raw wasm function. In this situation the callee would be passed as an argument to the trampoline in the same manner untyped trampolines receive the callee.

Anyway that's a long winded way of saying that we need a few cranelift instructions to modify compile_lowered_trampoline to fix the original issue here. I do not want to lose sight of how complicated our trampoline story is today though. We're already taking a hit to call overhead into and out of wasm as part of #4431 which we have no means of recovering right now, and I think reducing the trampolines in play and focusing more on Cranelift-generated trampolines is the way forward (e.g. inlining two trampolines into one). Otherwise I also think we will need fancier trampolines for other features such as the out-of-band fuel checking (requires a pinned register) and exceptions (which may require before/after stuff in the trampoline instead of just "before stuff" they do today).

fitzgen commented 2 years ago

You kind of mentioned this above, but to be super explicit: the hard part in my mind is deciding what we want to do when

the cranelift feature is not enabled, so we don't have a JIT at our disposal,
and then the embedder does let f = Func::wrap(...); f.call(...)

In this scenario, there is no already-compiled Wasm module for us to pluck trampolines from, and because we don't have a JIT available, we can't just create the necessary trampolines.

But also, in this scenario we don't actually need any trampolines because there isn't actually any Wasm involved (in #4431, this would show up as an empty contiguous sequence of Wasm frames). So maybe we can somehow relax things a bit (waves hands) to allow skipping the trampolines when both caller and callee are the host?

If one of caller or callee was Wasm, then we would be able to use trampolines from that Wasm. We would just need to figure out how we would lazily connect the trampolines to the Func if the Func::wrap happened before the Wasm module was loaded into the engine.

But yeah, agreed that we should simplify and improve our trampolines story, but this issue was originally supposed to just track support for saving entry SP and exit FP/return pointer for component trampolines at all. Might need to split this into two issues.

alexcrichton commented 2 years ago

Definitely agreed on that I went overboard and should split this to a separate issue, while we're here talking about it though the other issue we identified was FuncRef::from(Func::wrap(...)) because right now a FuncRef is a glorified *mut VMCallerCheckedAnyfunc which is "ready to be called by wasm" and that's not possible to do with a statically available trampoline today since wasm, if it calls the funcref, must call the trampoline which we won't have until that FuncRef makes its way into a module.

(I know FuncRef isn't really a type in Wasmtime but it's basically that we currently have to be able to get a *mut VMCallerCheckedAnyfunc from a Func at any time which isn't possible if trampolines are required to be in Module images)

bytecodealliance / wasmtime