Remove the need for inline assembly trampolines used by Wasmtime

alexcrichton commented 2 years ago

I'm opening this as a loose tracking issue for removing the need to have inline assembly trampolines defined by Wasmtime. Ideally all trampolines necessary could be provided by Cranelift instead of a mixture of what we have today of Rust-defined, inline assembly, and Cranelift-defined trampolines.

Below is a lot of words from https://github.com/bytecodealliance/wasmtime/issues/4535#issuecomment-1197071127 when I first wrote about this:

The stack unwinding in #4431 relies on precisely knowing the stack pointer when we enter WebAssembly along with the frame pointer and last program counter when we exit WebAssembly. This is not generally available in Rust itself so we are relying on handwritten assembly trampolines for these purposes instead.

Entry into WebAssembly

Entry into WebAssembly happens via one of two routes:

A "typed" route using the wasmtime::TypedFunc API or when invoking an core instance's start function (which has a known fixed signature of no inputs and no outputs). In these cases Rust does an indirect call directly to the Cranelift-generated code for the corresponding wasm function.
An "untyped" route which is used by wasmtime::Func::call as well as wasmtime::component::{Func,TypedFunc}::call. In this situation Rust will call a Cranelift-compiled trampoline. The Cranelift trampoline will load arguments from a stack parameter and then make an indirect call to the actual Cranelift-compiled wasm function which is also supplied as an argument.

Today this all records the entry stack pointer via the host_to_wasm_trampoline defined in inline assembly. Concretely Wasmtime will "prepare" an invocation which stores the Cranelift-generated function to call (be it a raw function in case (1) or a trampoline for case (2)) into the VMContext::callee field and then invoke the host_to_wasm_trampoline inline asm symbol.

This entry isn't too relevant to the component model since we're already doing what's necessary for the stack unwinding, recording the sp on entry. Nevertheless I want to describe the situation so I want to describe some oddities here as well:

The actual trampoline used in (2) to load arguments from the stack is not actually always defined by Cranelift. Instead sometimes it's a monomorphized Rust function host_to_wasm_trampoline from the Func::wrap API. This means we unfortunately cannot rely on Cranelift to supply all these trampolines which means we can't rely on the trampolines to do things that Rust itself can't do.
The entry trampoline currently requires the ability to tail-call to the actual callee. This is a technical limitation due to using the exact same trampoline for every single entry point, regardless of signature.

Ideally we would always enter WebAssembly via a Cranelift-compiled trampoline. That would mean we could do anything in the trampoline that Cranelift would do and ideally remove the need to have inline asm for this. We might still need multiple trampolines for untyped entry points and typed entry points, but overall we should ideally be able to do better here.

Exiting WebAssembly

Exiting back to the host happens in a few locations, and this is the focus of this issue where it's missing support in the component model:

Exiting from core wasm will either end up in something defined by Func::wrap or Func::new (roughly). Both of these use a VMHostFunctionContext which internally has two function pointers. One is the VMCallerCheckedAnyfunc which wasm actually calls and the other is the actual host function pointer defined in Rust being invoked. The function pointer contained within the VMCallerCheckedAnyfunc is a trampoline written in inline assembly which spills the fp/pc combo into VMRuntimeLimits. The function pointer to invoke contained within the VMHostFunctionContext has the "system-v ABI" since it receives arguments in native platform registers. For Func::wrap this is a Rust function and for Func::new this is a Cranelift-generated trampoline which spills arguments to the stack and then calls a static address specified at compile time (using Func::new requires Cranelift at runtime).
Exiting from a component will always exits via a lowered host function. Concretely what happens is that a VMComponentContext has an array lowering_anyfuncs: [VMCallerCheckedAnyfunc; component.num_lowerings]. This array is what core wasm actually calls and is exclusively populated by Cranelift-compiled trampolines (via compile_lowered_trampoline). These trampolines are similar to the Cranelift-compiled trampolines for Func::new but call a host function of type signature VMLoweringCallee. This is where fp/pc are not recorded while we exit wasm. There's not clear way to use the same trick as Func::{wrap,new} which have a singular inline asm trampoline for all signatures since the callee to defer to depends on the LoweringIndex.
Finally exiting wasm can also happen via libcalls implemented in Wasmtime. Currently each libcall gets a unique inline-asm-defined trampoline that records the pc/fp combo and then does a direct tail-call to the actual libcall itself.

Proposal to fix this issue

Overall I find the current trampoline story as pretty complicated and also pretty inefficient. There's typically at least one extra indirect call for all of these transitions and additionally there's very little cache-locality. The fix I'm going to propose here isn't a silver bullet though and will only solve some issues, but I think is still worth pursuing.

I think we should add few new pseudo-instructions to Cranelift:

Something to get the current frame pointer
Something to get the current stack pointer
Something to get the return address of the current function
Something to get the address of a label in a function (this may already exist, not sure)

With these tools we can start trying to eventually move all of the trampolines above to Cranelift exclusively and remove both Rust-defined and inline-asm defined trampolines:

For components, and this issue, compile_lowered_trampoline could be updated to use the cranelift instructions to record the pc/fp combo into the VMRuntimeLimits. This would remove the need for any extra trampoline when exiting a component and would solve the issue at hand.
For libcalls we could use the cranelift instructions to manually save fp/pc just before a libcall out to the runtime. This would remove all trampolines related to libcalls.
For Func::new the cranelift-generated trampoline could act similar to compile_lowered_trampoline and store the fp/pc combo to VMRuntimeLimits and avoid the need for two trampolines.
Untyped host-to-wasm trampolines could do the sp-saving internally rather than relying on the external trampoline to do so.

Those are at least the easy ones we could knock out with more Cranelift features. Otherwise there are still a number of places that we are requiring trampolines:

Exit trampolines with Func::wrap could ideally be generated by Cranelift but would still require two indirect calls. One call to get to the trampoline from the original core wasm and then a second call from the trampoline to the host function itself. The main problem here is getting a trampoline. Assuming trampolines are provided by Cranelift then they become available at runtiem when modules are loaded, which means Func::wrap needs to, at some point, dynamically look up a trampoline and find a corresponding one in a previous module's compiled image. This is not trivial.
Entry trampolines to TypedFunc are similarly somewhat nontrivial, but I think surmountable. Today a Store has a registry of untyped trampolines per-function signature, and I think it could also have a registry of typed trampolines per-function signature. This typed trampoline would then be used to enter wasm instead of today's calling the raw wasm function. In this situation the callee would be passed as an argument to the trampoline in the same manner untyped trampolines receive the callee.

cfallin commented 2 years ago

Something to get the current frame pointer

Something to get the current stack pointer

Something to get the return address of the current function

Something to get the address of a label in a function (this may already exist, not sure)

@fitzgen added the first three already in #4573; I'm curious about the last one (address of a label though) as the semantics of it and the implications to the compiler pipeline are a bit unclear to me. Is it like a second function entry, where we assume no register state is valid? Or is it assumed to be something like a longjmp target where we'll have some state valid from some other point in the function, so it's more like a special control-flow edge?

In other words, I can see a primitive defined one of several ways:

Define another block as a second entry-point to the function, and allow getting its address. This breaks all sorts of invariants and assumptions throughout the compiler (no domtree root! func args don't dominate all uses!) and I would strongly push back against it, unless there's a very clear need, then we would need to audit a bunch of code.
Define a "gap in the control flow" primitive of some sort: the user can say "I will eventually transfer control to block B by [some mechanism], and register state will be as-if control came directly from block A"; then it's allowed to get the address of block B and follow that contract. This is more like exceptional edges off of an invoke, in LLVM terms. I would want to model it as a control-flow edge somehow as well.

I'm not sure I fully grok the details of what a trampoline would need in this primitive but can you say more about which of the above fits better?

alexcrichton commented 2 years ago

Ah yeah sure I should expand more on that. The idea for getting the address of a label comes from the desire to remove our libcall trampolines right now. Each of the static set of libcalls has its own custom global_asm! trampoline which saves the fp/pc and then tail-calls to the actual libcall itself. Instead we would ideally save the fp/pc within the wasm function itself just before we enter the libcall, putting the work of saving fp/pc in the caller instead of the callee.

Assuming we do this then getting the current frame pointer is easy enough but for the 'last wasm pc' we actually need the address of the instruction after the call instruction itself. Having a label of sorts was my rough idea to do this because at least instruction-wise I want something like lea %dst, $const(%rip) or something like that to be the lowering. I don't think that this maps well to Cranelift abstractions currently though AFAIK (e.g. we don't really want a control-flow edge or to introduce more basic blocks, just "get the address of the instruction after some future call instruction")

cfallin commented 2 years ago

Ah, I see! So basically what we need is a "what will the return address be for this call instruction" primitive, is that right?

My first instinct would be to have an instruction that refers to the call instruction, but the problem with that is that it's a forward reference. But we could do the opposite and have the call refer to the "get return address" operator that came earlier. This would work fine with MachBuffer and forward emission order; we create the label first, then bind it just after the call. The CLIF would look something like:

    v1 := get_call_return_address
    ...
    v9 := call_and_provide_return_address fn0(v2, v3, v4, ...), v1

and I can see how to feed it through the pipeline without any problems I think. Does that make sense / fill the need?

alexcrichton commented 2 years ago

Yeah that looks perfect!

cfallin commented 2 years ago

So I spun on this for a few hours and stopped here at around ~500 LoC across 25 files... adding a notion of callsite labels turns out to be fairly cross-cutting and complex, though it is doable. With another ~4 hours or so I could push it through. I am a little apprehensive about the complexity; this is definitely not worth it for a one-off "avoid a single trampoline" tradeoff IMHO; but if it gets us efficiency improvements and you think it's important enough, I can definitely pick it back up later.

uweigand commented 2 years ago

Instead we would ideally save the fp/pc within the wasm function itself just before we enter the libcall, putting the work of saving fp/pc in the caller instead of the callee.

Just a thought: does it have to be the exact pc of the call/return site? Wouldn't a pc anywhere in the calling function be sufficient to provide the correct function name in backtraces? (For DWARF CFI unwinding we of course need the exact PC, but we're not doing that anymore ...)

alexcrichton commented 2 years ago

I don't think performance is critical here (at least not yet) so this isn't urgent to implement, but I would personally still like to cut down our reliance on inline assembly, especially for entry/exit trampolines that requires a "unityped" trampoline for all function signatures. Requiring these trampolines precludes other possible future features like fancier exception handling things, pinned registers, etc.

does it have to be the exact pc of the call/return site?

While it doesn't have to be 100% precise per-se it also can't just be anywhere in the function. Libcalls can trigger GC operations which need a precise stack map for where we're at in the function, which is the requirement I know of.

akirilov-arm commented 2 years ago

I have a somewhat related question - now that PR #3606 has been merged, on AArch64 we have to be careful whenever return addresses are moved from registers to memory, which is what the current inline assembly trampolines do, and what Cranelift-compiled trampolines would continue doing in the future. However, as far as I can tell the values saved by the trampolines do not influence control flow in the sense that they are only used to produce backtraces. Is that correct? If yes, then there is no need to sign them before storing to memory.

fitzgen commented 2 years ago

They don't influence control for now, but when we get around to implementing the Wasm exceptions proposal, then they will.

fitzgen commented 1 year ago

https://github.com/bytecodealliance/wasmtime/pull/6262 removes most of the hand-written asm trampolines. All that are left after that PR are the wasm-to-libcall trampolines.

alexcrichton commented 7 months ago

Final ones done in https://github.com/bytecodealliance/wasmtime/pull/8152 now, so closing.

bytecodealliance / wasmtime