Closed alexcrichton closed 7 months ago
- Something to get the current frame pointer
- Something to get the current stack pointer
- Something to get the return address of the current function
- Something to get the address of a label in a function (this may already exist, not sure)
@fitzgen added the first three already in #4573; I'm curious about the last one (address of a label though) as the semantics of it and the implications to the compiler pipeline are a bit unclear to me. Is it like a second function entry, where we assume no register state is valid? Or is it assumed to be something like a longjmp target where we'll have some state valid from some other point in the function, so it's more like a special control-flow edge?
In other words, I can see a primitive defined one of several ways:
invoke
, in LLVM terms. I would want to model it as a control-flow edge somehow as well.I'm not sure I fully grok the details of what a trampoline would need in this primitive but can you say more about which of the above fits better?
Ah yeah sure I should expand more on that. The idea for getting the address of a label comes from the desire to remove our libcall trampolines right now. Each of the static set of libcalls has its own custom global_asm!
trampoline which saves the fp/pc and then tail-calls to the actual libcall itself. Instead we would ideally save the fp/pc within the wasm function itself just before we enter the libcall, putting the work of saving fp/pc in the caller instead of the callee.
Assuming we do this then getting the current frame pointer is easy enough but for the 'last wasm pc' we actually need the address of the instruction after the call instruction itself. Having a label of sorts was my rough idea to do this because at least instruction-wise I want something like lea %dst, $const(%rip)
or something like that to be the lowering. I don't think that this maps well to Cranelift abstractions currently though AFAIK (e.g. we don't really want a control-flow edge or to introduce more basic blocks, just "get the address of the instruction after some future call instruction")
Ah, I see! So basically what we need is a "what will the return address be for this call instruction" primitive, is that right?
My first instinct would be to have an instruction that refers to the call instruction, but the problem with that is that it's a forward reference. But we could do the opposite and have the call refer to the "get return address" operator that came earlier. This would work fine with MachBuffer
and forward emission order; we create the label first, then bind it just after the call. The CLIF would look something like:
v1 := get_call_return_address
...
v9 := call_and_provide_return_address fn0(v2, v3, v4, ...), v1
and I can see how to feed it through the pipeline without any problems I think. Does that make sense / fill the need?
Yeah that looks perfect!
So I spun on this for a few hours and stopped here at around ~500 LoC across 25 files... adding a notion of callsite labels turns out to be fairly cross-cutting and complex, though it is doable. With another ~4 hours or so I could push it through. I am a little apprehensive about the complexity; this is definitely not worth it for a one-off "avoid a single trampoline" tradeoff IMHO; but if it gets us efficiency improvements and you think it's important enough, I can definitely pick it back up later.
Instead we would ideally save the fp/pc within the wasm function itself just before we enter the libcall, putting the work of saving fp/pc in the caller instead of the callee.
Just a thought: does it have to be the exact pc of the call/return site? Wouldn't a pc anywhere in the calling function be sufficient to provide the correct function name in backtraces? (For DWARF CFI unwinding we of course need the exact PC, but we're not doing that anymore ...)
I don't think performance is critical here (at least not yet) so this isn't urgent to implement, but I would personally still like to cut down our reliance on inline assembly, especially for entry/exit trampolines that requires a "unityped" trampoline for all function signatures. Requiring these trampolines precludes other possible future features like fancier exception handling things, pinned registers, etc.
does it have to be the exact pc of the call/return site?
While it doesn't have to be 100% precise per-se it also can't just be anywhere in the function. Libcalls can trigger GC operations which need a precise stack map for where we're at in the function, which is the requirement I know of.
I have a somewhat related question - now that PR #3606 has been merged, on AArch64 we have to be careful whenever return addresses are moved from registers to memory, which is what the current inline assembly trampolines do, and what Cranelift-compiled trampolines would continue doing in the future. However, as far as I can tell the values saved by the trampolines do not influence control flow in the sense that they are only used to produce backtraces. Is that correct? If yes, then there is no need to sign them before storing to memory.
They don't influence control for now, but when we get around to implementing the Wasm exceptions proposal, then they will.
https://github.com/bytecodealliance/wasmtime/pull/6262 removes most of the hand-written asm trampolines. All that are left after that PR are the wasm-to-libcall trampolines.
Final ones done in https://github.com/bytecodealliance/wasmtime/pull/8152 now, so closing.
I'm opening this as a loose tracking issue for removing the need to have inline assembly trampolines defined by Wasmtime. Ideally all trampolines necessary could be provided by Cranelift instead of a mixture of what we have today of Rust-defined, inline assembly, and Cranelift-defined trampolines.
Below is a lot of words from https://github.com/bytecodealliance/wasmtime/issues/4535#issuecomment-1197071127 when I first wrote about this:
The stack unwinding in #4431 relies on precisely knowing the stack pointer when we enter WebAssembly along with the frame pointer and last program counter when we exit WebAssembly. This is not generally available in Rust itself so we are relying on handwritten assembly trampolines for these purposes instead.
Entry into WebAssembly
Entry into WebAssembly happens via one of two routes:
wasmtime::TypedFunc
API or when invoking an core instance'sstart
function (which has a known fixed signature of no inputs and no outputs). In these cases Rust does an indirect call directly to the Cranelift-generated code for the corresponding wasm function.wasmtime::Func::call
as well aswasmtime::component::{Func,TypedFunc}::call
. In this situation Rust will call a Cranelift-compiled trampoline. The Cranelift trampoline will load arguments from a stack parameter and then make an indirect call to the actual Cranelift-compiled wasm function which is also supplied as an argument.Today this all records the entry stack pointer via the
host_to_wasm_trampoline
defined in inline assembly. Concretely Wasmtime will "prepare" an invocation which stores the Cranelift-generated function to call (be it a raw function in case (1) or a trampoline for case (2)) into theVMContext::callee
field and then invoke thehost_to_wasm_trampoline
inline asm symbol.This entry isn't too relevant to the component model since we're already doing what's necessary for the stack unwinding, recording the sp on entry. Nevertheless I want to describe the situation so I want to describe some oddities here as well:
host_to_wasm_trampoline
from theFunc::wrap
API. This means we unfortunately cannot rely on Cranelift to supply all these trampolines which means we can't rely on the trampolines to do things that Rust itself can't do.Ideally we would always enter WebAssembly via a Cranelift-compiled trampoline. That would mean we could do anything in the trampoline that Cranelift would do and ideally remove the need to have inline asm for this. We might still need multiple trampolines for untyped entry points and typed entry points, but overall we should ideally be able to do better here.
Exiting WebAssembly
Exiting back to the host happens in a few locations, and this is the focus of this issue where it's missing support in the component model:
Func::wrap
orFunc::new
(roughly). Both of these use aVMHostFunctionContext
which internally has two function pointers. One is theVMCallerCheckedAnyfunc
which wasm actually calls and the other is the actual host function pointer defined in Rust being invoked. The function pointer contained within theVMCallerCheckedAnyfunc
is a trampoline written in inline assembly which spills the fp/pc combo intoVMRuntimeLimits
. The function pointer to invoke contained within theVMHostFunctionContext
has the "system-v ABI" since it receives arguments in native platform registers. ForFunc::wrap
this is a Rust function and forFunc::new
this is a Cranelift-generated trampoline which spills arguments to the stack and then calls a static address specified at compile time (usingFunc::new
requires Cranelift at runtime).VMComponentContext
has an arraylowering_anyfuncs: [VMCallerCheckedAnyfunc; component.num_lowerings]
. This array is what core wasm actually calls and is exclusively populated by Cranelift-compiled trampolines (viacompile_lowered_trampoline
). These trampolines are similar to the Cranelift-compiled trampolines forFunc::new
but call a host function of type signatureVMLoweringCallee
. This is where fp/pc are not recorded while we exit wasm. There's not clear way to use the same trick asFunc::{wrap,new}
which have a singular inline asm trampoline for all signatures since the callee to defer to depends on theLoweringIndex
.Proposal to fix this issue
Overall I find the current trampoline story as pretty complicated and also pretty inefficient. There's typically at least one extra indirect call for all of these transitions and additionally there's very little cache-locality. The fix I'm going to propose here isn't a silver bullet though and will only solve some issues, but I think is still worth pursuing.
I think we should add few new pseudo-instructions to Cranelift:
With these tools we can start trying to eventually move all of the trampolines above to Cranelift exclusively and remove both Rust-defined and inline-asm defined trampolines:
compile_lowered_trampoline
could be updated to use the cranelift instructions to record the pc/fp combo into theVMRuntimeLimits
. This would remove the need for any extra trampoline when exiting a component and would solve the issue at hand.Func::new
the cranelift-generated trampoline could act similar tocompile_lowered_trampoline
and store the fp/pc combo toVMRuntimeLimits
and avoid the need for two trampolines.Those are at least the easy ones we could knock out with more Cranelift features. Otherwise there are still a number of places that we are requiring trampolines:
Func::wrap
could ideally be generated by Cranelift but would still require two indirect calls. One call to get to the trampoline from the original core wasm and then a second call from the trampoline to the host function itself. The main problem here is getting a trampoline. Assuming trampolines are provided by Cranelift then they become available at runtiem when modules are loaded, which meansFunc::wrap
needs to, at some point, dynamically look up a trampoline and find a corresponding one in a previous module's compiled image. This is not trivial.TypedFunc
are similarly somewhat nontrivial, but I think surmountable. Today aStore
has a registry of untyped trampolines per-function signature, and I think it could also have a registry of typed trampolines per-function signature. This typed trampoline would then be used to enter wasm instead of today's calling the raw wasm function. In this situation the callee would be passed as an argument to the trampoline in the same manner untyped trampolines receive the callee.