bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
15.4k stars 1.3k forks source link

Investigate use of stack probes and removal of explicit stack-limit checks #8135

Open cfallin opened 8 months ago

cfallin commented 8 months ago

Currently, Wasmtime uses explicit stack-limit checks to avoid overflowing the stack, and these checks are generated in the prologue of every Wasm function. This is largely a consequence of the design decision to run on the host's stack (when not doing async): the two subcases are either Wasm hits the guard page itself (and we don't currently have a mechanism to recover from that, with a separate signal stack) or Wasm uses almost all the stack then calls back into the host and the host then overflows (which looks like a host crash). So the advantage of explicit stack-limit checks is that they retain control while giving us a well-defined path to return a trap, but the cost is that they impose overhead on every function call. (And potentially nontrivial overhead: the limit is several chained loads deep in data structures.)

The alternative idea is to switch to reliance on a guard page, with stack probes in functions that have frames larger than a single guard page (just as with native code today), and handle the two cases above by:

(Not coincidentally, I believe this is also the exact scheme that was used by Lucet!)

alexcrichton commented 8 months ago

Some other bits I've thought of after more thinking:

alexcrichton commented 8 months ago

Also I'm completely wrong about sigaltstack: we are currently installing a sigaltstack already

cfallin commented 8 months ago

Thanks for filling in a ton of details here! A few thoughts:

alexcrichton commented 8 months ago

I've been nerd sniped.


it can generate the same limit-checking just before a given libcall

I'm a bit worried about the complexity of this because libcalls are completely invisible to Wasmtime here. In that sense it's really late in the x64 backend that we realize that a libcall is required and at that point we've lost a lot of context necessary to generate a libcall.

I think the best solution here will be to implement something like https://github.com/bytecodealliance/wasmtime/pull/8152 but for cranelift-generated libcalls as well. That way we can, after compilation, see what libcalls are required and then go generate a small shim for each libcall.

This also handles things like the fp/pc of a wasm module because we can't do that inline but we instead need an actual function frame to record the exit fp/pc (I forget the exact reasons for this but it's why we had to have separate trampolines for builtin functions in wasmtime)

Calls to Wasmtime helpers

Hopefully handled through https://github.com/bytecodealliance/wasmtime/pull/8152 now!

Trap-info precision: we could have a notion of range in addition to exact PC

For <1page functions, I implemented this which basically tests whether the faulting address is within one page of the stack pointer. My hope is that this is precise enough to say "yeah it's a stack overflow" and also easy enough that we don't need metadata on every possible instruction that modifies the stack in Cranelift (e.g. I don't want to have to rearrange the stack args/clobbers/etc).

The main thing that I'm worried about is a situation like:

;; prologue stuff, does not fault
sub rsp, $many_pages
;; start probes
or [rsp + $stack_size_minus_one_page], 0
or [rsp + $stack_size_minus_two_pages], 0
;; ...

where if the first or faults then the faulting address is pretty far away from rsp, perhaps many pages. That I think can be solved though with explicit trap metadata per probe.

I suppose put another way, I'd like to ideally frame the problem as:

All stack overflows caught through the guard page can either be classified as:

alexcrichton commented 8 months ago

(sorry hit comment a bit too soon so I added a little bit more context at the end)

cfallin commented 8 months ago

The main thing that I'm worried about is a situation like: [sub rsp once, many stores with large offsets]

Fortunately we seem to decrement rsp and then store in stages. The comment there cites Valgrind compatibility, which I think specifically means that rsp is decremented first; I don't know if there's a reason we do individual decrements per store on x86, though on aarch64/riscv64 it would allow us to use store-with-immediate-offset insts with small offset fields. In any case, we do seem to already have this "fault is close to actual current SP" property already, which is nice!

cfallin commented 8 months ago

(And likewise the dynamic loop version seems to have the same property)

alexcrichton commented 8 months ago

Oh nice, so then if we're comfortable saying all faults within 1 page of rsp are stack-overflow related faults, I think all we need to do is turn on stack probes and call it a day then?

cfallin commented 8 months ago

You would know better than me but that sounds plausible, given your work on helper calls in the linked PR!

alexcrichton commented 8 months ago

I poked a bit more at this, specifically with respect to cranelift-injected libcalls (e.g. ceilf). These are not as simple as wasmtime builtins because they don't take a *mut VMContext argument (which I forgot about). The best way I could think of for implementing this was:

I got all that wired up and working and AFAIK that covers the cases that at least I could think of where the guard page will now guaranteed be hit in Cranelift-generated code.


Stepping back a bit though, I thought a little more about this at a higher level. I ran sightglass's default suite and it showed a 3-4% improvement in execution on spidermonkey.wasm, but no changes elsewhere. I'm not sure if it would show a larger improvement in the situations that you're looking at @cfallin though if they're call-heavy benchmarks.

I'm a bit worried about the complexity of this change. Lots about Cranelift/Wasmtime are striking the right balance between simplicity and performance, and I don't think there's any silver bullet for handling stack overflow in wasm. The stack limit check is quite simple but not exactly speedy. Using a guard page feels pretty complex and subtle because we need a guarantee that only Cranelift-generated code hits the guard page, never native code. I've done my best to enumerate all the possible ways that I personally know of we need to handle but I have little confidence that my knowledge is exhaustive, for example I haven't analyzed components yet to see if they need updates too.

More-or-less I think we should probably have more compelling performance data motivating this change before pursuing it further. If 3-4% breaks the bank then that seems ok to pursue, but in abstract 3-4% feels pretty small and perhaps not worth the complexity. If there's a 40% improvement in a different benchmark though that's of course more compelling.

cfallin commented 8 months ago

@alexcrichton thanks a bunch for pushing this further! I agree it's best to do this only if it actually matters. In my own work I can at least test your branch if it's available, once I get ICs (lots of tiny function calls) wired up properly, and give more data; it'd have to be worth 1.5x-2x or so in my mind.

alexcrichton commented 8 months ago

This is the (messy) branch with various bits of work, but for performance testing I'd just comment out this line since the test case probably doesn't rely on catching stack overflow anyway.