Improve wasm handling for stack overflow conditions

alexcrichton commented 4 years ago

Currently wasmtime handles stack overflow in wasm code through the use of the segfault signal handler, relying on wasm code to hit the guard page to trigger a segfault which we the longjmp to recover from. Unfortunately, though, this has a consequence where it doesn't work well for "almost stack overflowed" wasm code which calls native imports.

For example let's say that a wasm module has an extremely deep call stack, but then it calls an imported function where then the imported function hits the guard page. This will trigger a segfault, and we still want this to be a somewhat recoverable situation! In this scenario though it's not actually save to longjmp over a bunch of Rust frames since it can skip some critical destructors.

I think in general we'll want to update wasmtime's handling of stack overflow in WebAssembly code to not rely on segfaults. In talking with some Spidermonkey folks, I think we'll want to switch to a scheme that looks something like:

We "reserve" a fixed amount of stack space (in bytes) for wasm. Not literally on the side, but we say that the wasm stack can't ever be larger than "N" bytes.
Each wasm function has a prelude check against "where is the current stack", and manually aborts if it's about to overflow. This may require a form of a dedicated register for storing the latest stack limit.
Then we'll limit segfault handling only to out of bounds loads/stores in wasm code, and not to stack overflow.

This should help retain all our current speed (with sufficient tweaks/optimization) but also allow us to guarantee that native code imported by wasm always has a reasonable amount of stack space. This means that stack overflow in native code would never be recovered from (it'd abort the process as typical Rust programs do).

pepyakin commented 4 years ago

But wouldn't such a prelude check, even if correctly predicted, hinder the performance? Did you consider an option with a dedicated stack a-lá stacker? One option is to run wasm code on a separate stack and switch it back to the native stack upon entering a host function. That would make host calls slower, but OTOH will remove overhead from each function's prelude.

alexcrichton commented 4 years ago

I think this is something we'd have to measure, but I'd imagine that if Spidermonkey, what I believe is widely accepted as the fastest wasm engine, takes this strategy then we should be fine.

iximeow commented 4 years ago

This is very similar to an issue we've been looking at addressing in lucet, where the "almost stack overflowed" case has similar fatal end results for us. The approach we've been most seriously considering looks much like this, though we're planning on keeping a guard page and only checking against some reserved stack space limit at the point we call into native code. This avoids needing instrumentation on each internal call, instead trading for a memory access, to get the stack limit from VMCtx.

In the lucet case, we'd like to avoid reserving an extra native thread(ish) of stack unconditionally, so we might tweak this approach a bit more. I've recently suggested that we reframe the approach as having an amount of stack space reserved for native calls. Then, in the case that the wasm stack doesn't have this much space available, we allocate a NATIVE_RESERVED_STACK_SIZE chunk of memory and do a native call there instead. In the happy path, we can do a native call on the same stack with the overhead of a memory access and compare. In the less happy path, we can try to proceed instead of aborting the program, assuming the system itself has memory available.

One concern we've run into is the case of mutual recursion between wasm code and native code. If this is a concern for wasmtime too? If it is, it sounds like you'd handle this by using a new stack limit when calling back into wasm code, so that makes sense.

I assume a stack overflow in native code would then be considered a native code error by virtue of not fitting within the reserved stack space? It looks like those would be program-ending faults still. One of the reasons we want lucet-runtime to continue handing segfaults in the stack guard page is to try isolating these errors, but if we end up needing to unwind native call frames to run destructors we might use more stack leading to an overflow when trying to unwind from an overflow...

alexcrichton commented 4 years ago

@iximeow that sounds like a strategy that would work for this as well. I'm always iffy myself on tracking stack sizes, but the general idea is the same where whenever native code is running we try to provide a guarantee there's a "reasonable amount of stack space". I figured though that if this limit was ever blown we'd just return a wasm stack overflow trap rather than trying to allocate more native stack at the last second to keep running code.

I think we'd definitely want to handle mutually recursive wasm/native code, and my thinking there at lesat is that you might get some weird errors like shallow wasm invocations triggering stack overflow b/c a previous wasm module recursed so much, but as long as everything has pretty reasonable limits I think it'll probably turn out roughly ok.

I do agree yeah native stack overflow is still going to be a program-ending fault, and I'm mostly uncomfortable running sigsegv signals on the sigaltstack which is frequently too small, and if we didn't do that there'd be no way to have libstd's nice "you overflowed the stack" message as well.

alexcrichton commented 4 years ago

For posterity here's a test which shows the bug we have today, and how we're "catching" segfaults in host code.

bytecodealliance / wasmtime

Improve wasm handling for stack overflow conditions #900