Slacked fuel metering - Githubissues

bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly

https://wasmtime.dev/

Apache License 2.0

15.4k stars 1.3k forks source link

Slacked fuel metering #4109

Open pepyakin opened 2 years ago

pepyakin commented 2 years ago

wasmtime right now has the fuel mechanism. It allows precise control of how many instructions are executed before suspending execution, at least at a basic block granularity. The price is a rather significant performance hit.

I believe we can do better in case the user only wants to know if the execution exceeded some fuel limit. The cost to pay is that condition is detected with a delay, hence slacked. Therefore, this works similarly to the fuel metering mechanism only if there are no side effects when the execution reaches its deadline. For instance, all changes performed are rolled back if a DB transaction exceeds its fuel allowance.

The idea is to take the existing fuel mechanism, remove the compare & conditional jumps, and check the out-of-fuel condition asynchronously.

The tricky part is the asynchronous fuel variable checking and suspending the execution.

We could leverage HotSpot-style safepoints to implement this. Such a safepoint could be implemented just as access to a page. Suspending the execution could be triggered by mprotect-ing the page. Similarly to epoch interruption, we could place safepoints when entering a function and loop backedges. We would need to extract the fuel counter while handling the page fault signal. We assume that the codegen thoughtfully left us a mapping where the fuel counter is saved: either a register or an offset for a spilled variable. Having recovered the fuel counter, we'd check if it exceeded the limit and yield if so.

I realized that this is not trivial to benchmark. I tried removing fuel_check, but that would lead to DCE of the now unused fuel value. I would appreciate suggestions on how to benchmark that properly.

Does this sound sensible? Am I missing something obvious?
Is this something that could be considered to be accepted upstream?

alexcrichton commented 2 years ago

One of the primary reasons for fuel as oppose to epochs is due to the deterministic nature of fuel, which I think is why a scheme like you've proposed where with asynchronous checks wouldn't entirely supplant fuel but would perhaps be another method of limiting wasm execution in Wasmtime. In some sense though what it seems like you're looking for is a hybrid of epochs (asynchronous cancellation, low overhead) and fuel (the measure of cancellation being how much code has run). I wonder if that might make more sense? For example if you could read (but not write) fuel from another thread then you could increment the epoch when fuel runs out or something like that.

In terms of the cost of fuel, though, I believe the main cost associated with it is not the checks for out-of-fuel but rather the management of the fuel counter itself which has lots of loads and, critically, stores to memory. The repeated memory traffic ends up being quite costly as oppose to epochs which is simply always reading the same memory location which should be more efficient (and from our benchmarking the efficiency of epochs is much better).

I suppose that I can put this a little differently in that if the fuel check was moved asynchronously I'm not sure it would buy much. If the check itself is moved externally then that loses the determinism of fuel, and I also think that the main cost associated with fuel will still be present, namely the management of the fuel counter itself.

cfallin commented 2 years ago

There does seem to be some utility to a "count the ops but don't interrupt" configuration for fuel, in that some users may be interested in basically a perf counter/stat but don't need to set limits. I agree that the removal of the control flow will only remove a part of the overhead for fuel; but it should significantly shrink code size, which is something.

I guess mostly I'd love to see measurements: @pepyakin have you experimented with this already? If not, would you be willing to hack something together? It shouldn't be too complex: basically we want to return from a codegen helper here (just after fuel_increment_var, before we create the blocks and add the compare/branch).

cfallin commented 2 years ago

(For clarity, if this shows a sufficient benefit for your use-case, then I agree with Alex's suggested hybrid approach above: the way to go is to make it a config option to count fuel without interrupts, and then use epoch interruption to actually halt execution asynchronously.)

pepyakin commented 2 years ago

In some sense though what it seems like you're looking for is a hybrid of epochs (asynchronous cancellation, low overhead) and fuel (the measure of cancellation being how much code has run). I wonder if that might make more sense? For example if you could read (but not write) fuel from another thread then you could increment the epoch when fuel runs out or something like that.

Yes, this is similar to what I was looking for.

Although I would still clarify,

I am interested in if a given program finished executing within a given number of steps.
If it was not able to finish, I am not interested in the results.
I am also ok with allowing some time slack until the detection that the program exceeded the number of steps.

Given point 2 this can be deterministic: ignoring timing, multiple executions of the same program with the same inputs will lead to the same outcome.

I was also wondering about the same construction that you propose. I assumed that that approach would require memory writes on each loop header & function call and those have to be atomic, and I assumed that would be even more overhead.

In terms of the cost of fuel, though, I believe the main cost associated with it is not the checks for out-of-fuel but rather the management of the fuel counter itself which has lots of loads and, critically, stores to memory. The repeated memory traffic ends up being quite costly as oppose to epochs which is simply always reading the same memory location which should be more efficient (and from our benchmarking the efficiency of epochs is much better).

I might have explained myself poorly.

All this faffing around with safepoints I proposed was in the hope, that we can avoid not only compare&branch but the memory writing as well. It is avoiding, not eliminating though since the fuel counter still can be spilled.

I guess mostly I'd love to see measurements: @pepyakin have you experimented with this already? If not, would you be willing to hack something together? It shouldn't be too complex: basically we want to return from a codegen helper here (just after fuel_increment_var, before we create the blocks and add the compare/branch).

So all of the above are just uneducated guesses, turns out. I've tried to make a benchmark but then I realized that did a mickey mouse mistake, the one about DCE I mentioned in OP.

I'm too will love to see measurements and willing to hack. I am not sure what would be the best way to do that though.

cfallin commented 2 years ago

So all of the above are just uneducated guesses, turns out. I've tried to make a benchmark but then I realized that did a mickey mouse mistake, the one about DCE I mentioned in OP.

Ah, sorry, I had missed that note; but it's a little surprising given that here we call fuel_save_from_var to store the value back to memory, and this is invoked at every function exit. Stores shouldn't be DCE'd (since they are side-effecting). Can you dump the CLIF (RUST_LOG=trace should show it buried among a bunch of other output) and see what it looks like just after wasm translation?

alexcrichton commented 2 years ago

I assumed that that approach would require memory writes on each loop header & function call and those have to be atomic, and I assumed that would be even more overhead.

With epochs the increment of the epoch is an atomic write but all the reads are "just" atomic reads which in practice ends up being quite fast. The instrumented code itself doesn't do any atomic writes anywhere (same for fuel, not atomic writes anywhere)

Given your clarification (thanks!) it does sort of sound like you want fuel + epochs without the actual interruption of fuel though, just the accounting. This may also need to be specialized a bit in codegen to make things a bit more optimal but otherwise seems pretty close?

pepyakin commented 2 years ago

Ah, sorry, I had missed that note; but it's a little surprising given that here we call fuel_save_from_var to store the value back to memory, and this is invoked at every function exit. Stores shouldn't be DCE'd (since they are side-effecting). Can you dump the CLIF (RUST_LOG=trace should show it buried among a bunch of other output) and see what it looks like just after wasm translation?

Was a mistake on my part, I was doing something else. Modifying fuel_check skipping everything after fuel_increment_var as you proposed works.

So the results with my microbenchmarks are:

fuel overhead with interrupts 24-34%, fuel overhead without interrupts 10-25%

which is something, but ideally we lower it even more.

This may also need to be specialized a bit in codegen to make things a bit more optimal but otherwise seems pretty close?

Could you elaborate on this bit?

pepyakin commented 2 years ago

FWIW, enabling epoch interrupts and consuming fuel (but without fuel interrupts) gives 28-40% overhead.

cfallin commented 2 years ago

ideally we lower it even more.

So, some fairly complex ideas incoming: there are at least research-level ideas floating around (I can't find any papers at the moment unfortunately but I know I've seen it before) on how to minimally/optimally record the trace (path taken through a program). Counting ops is equivalent to this problem so I think the same ideas apply. It goes something like: choose a set of "profiling points" (here, points at which you increment the counter), such that every branch outcome can be deduced from the points that are reached.

The simple/intuitive cases are loops (one profiling point in the header) and if/then diamonds (one profiling point in each branch) and in fact that's what we do today, rather than a profiling point in every block. But it actually starts to get interesting even with the if/else: we really only need a profiling point in one branch of the diamond. E.g., if the if-path takes 10 ops, and the else-path takes 15 ops, then we unconditionally add 10 (probably as part of the profiling point at the top of the function or enclosing loop), and add 5 more only on the else-path. Bonus points if we choose the rare path for the additional increment.

The general answer to this problem I think will have something to do with dom/postdom relations -- as a first pass, can we merge a block's profiling point (fuel increment amount) into the increment of any block it postdominates? And then there might be another transform where we draw a "postdom cut" across N blocks relative to some block, and hoist some common factor k out of all of those blocks (i.e., if anywhere we go will eventually increment by at least 10, let's hoist that). It has to be a little more subtle than that because loop bodies can postdom the post-loop tail, so we really need some sort of "1-to-1 execution count" relation. But I'd have to sit down and think more (or just find the papers on this!).

cfallin commented 2 years ago

Ah, one more idea that just came to mind: we could potentially remove a lot of the memory-traffic overhead of fuel counting by defining a new ABI with a "pinned fuel counter". Then it only needs to be loaded from the vmctx in host-to-wasm trampolines, and we could "spill" to vmctx at callsites to non-pinned-fuel-counter ABIs (direct calls to imported host functions for example).

pepyakin commented 2 years ago

So, some fairly complex ideas incoming...

Oh, right, something like this also crossed my mind.

It looks that the ABI idea is easier to hack though. You don't mean literal new ABI (as in a new variant in CallConv), since it seems to me that you can get away with changes only in crates/cranelift though, right?

cfallin commented 2 years ago

It looks that the ABI idea is easier to hack though. You don't mean literal new ABI (as in a new variant in CallConv), since it seems to me that you can get away with changes only in crates/cranelift though, right?

We could do it that way; but passing the value then has to be explicit (argument into every function, and extra return value from every function). Also slightly less efficient since those would be different registers, so some extra moves.

I had been thinking a flag on the Cranelift options that means "reserve a register that is globally allocated for this purpose, and provide CLIF ops to get and set it", but actually we already have that: I realized that we aren't using the "pinned reg" for anything in Wasmtime's wasm-to-CLIF lowering right now, so we could set that option and then use get_pinned_reg / set_pinned_reg to "load" and "store" the fuel count.

We'd still need to transfer it to and from the vmctx when transitioning from host to Wasm code (in the trampoline) and from Wasm to host code (this one is a little trickier).

pepyakin commented 2 years ago

Well, I went ahead and implemented the explicit value passing. It's messy and now I stumbled upon some problem in the wasm translation code. I suspect there is some hidden dependency on the fact that there are only two hidden params. I am debugging it right now. The trampoline code is already in place though.

Regarding get_pinned_reg/set_pinned_reg, that's funny, I looked at it but thought it's not worth of exploration thinking that it's a remnant of the old style backend. If that works, that means cranelift always has a reserved register that is not used for anything then?

UPD: Nvm, I see there is enable_pinned_reg.

alexcrichton commented 2 years ago

IIRC there's a fair bit of open-coding about our two implicit VMContext parameters. I think it would be great to reduce this open-coding if we can, though. But if you're hitting issues with our current two hidden parameters that's probably what's happening.

Otherwise though one important bit about ABI details and pinned registers is that right now Wasmtime's API relies on the fact that trampolines can be defined natively in Rust. This is tested currently and will break if there's ABI details Rust can't uphold (like preserving the pinned register). Overall I personally feel this is a mis-feature of Wasmtime as the only thing it buys us is the ability to Func::call a native function wrapped up via a Func, which no one really wants to do. Instead I think we should use the trampoline registration system that is already in place for modules and just simply require that lookup succeeds (e.g. documenting that Func::call in a store with no modules will panic). That would mean that we no longer have to define trampolines in Rust (and that host_trampoline function could be deleted) and all trampolines would be assembled by Cranelift. We'd still have to define a host ABI for the trampolines to talk to Rust but having that transition point where Cranelift is always in control seems prudent.

This latter point though of refactoring how trampolines-to-imported-host-functions work would require some relatively deep refactorings. I think something about the VMContext would have to change to support host trampolines, but I'm not sure precisely what having not designed it yet.

Basically I just wanted to mention that dealing with a pinned register and/or non-System-V ABIs I think is a good idea and something we should enable, but I don't believe it can trivially be done so today without more significant refactorings of what a trampoline looks like when wasm enters the host. (and enabling this all would probably best be done as a separate PR)

pepyakin commented 2 years ago

Hacking on the explicit passing of the fuel counter without storing it in the memory, I bumped on a bug. It manifests with fuel metering and the unreachable instruction. Basically, the transfer of control out of there means the fuel counter should be flushed just before the trap. That's easy enough.

However, other traps are tricky: memory accesses (at least with the eliminated bounds checked / mprotect) and stack overflows. If we were to flush the fuel counter, it would have to be done pretty much everywhere. I think it's not impossible that it would fare worse than vanilla fuel metering. Am I missing something?

The pinned fuel counter approach proposed by @cfallin might work in this case. I assume traps always transfer control through a signal handler. We could extract the contents of the pinned register from the last frame before the trap from within the signal handler and then flush it to VMRuntimeLimits.

I read from @alexcrichton's post that the codebase is not ready for this change yet. However, I am trying to get a sense of potential wins here, not a final solution. Do you folks see any roadblocks to hacking the solution here? If not, I might try to tackle that approach then.

alexcrichton commented 2 years ago

I think reading the fuel in the signal handler and writing it there instead of in the jit code would be a good approach. While getting this all 100% correct will be difficult I suspect you could "just implement it" today and get some good perf numbers for what the final solution would look like performance-wise.

pepyakin commented 2 years ago

I've hacked a version that uses the pinned_reg approach. I did not bother implementing extracting fuel in the signal handler, though. I just checked that it is possible, and it seemed positive.

The results are mixed: one benchmark suffers from a slowdown. I haven't isolated the root cause.

However, I ran an extra experiment. Basically, I just omitted calls to fuel_check.

Similar to the one I described in OP, the idea is to periodically send a signal to interrupt the thread to check the counter value. Except, it is way simpler,

No stack maps are needed since it's always the same pinned register.
No safepoints are needed. That was just dumb. If the counter is positive, I assume it would be possible to jump out from the signal handler back to the user code, reporting an OOG trap.

According to my measurements, the overhead is around ≈5% relative to no fuel metering. This result is promising.

Do you think this approach is something that could be upstreamed?

alexcrichton commented 2 years ago

That sounds pretty reasonable to me, although I think there's probably trickiness to work out along the lines of a signal being sent to a thread that isn't currently executing wasm, e.g. it's in the middle of an import or something like that. In that situation is it something that's ok to unconditionally ignore? Or should the signal get "deferred" to get handled once wasm is reentered?

Also to make sure I understand, the 5% overhead is basically incrementing a pinned register which is fuel consumed, right? And presumably one less register in register allocation as well, too.

pepyakin commented 2 years ago

I assumed we could just check IS_WASM_PC for the IP while handling a signal. If the control is not inside of the wasm then we ignore it. It should be enough at least for our use cases. I thought that would be acceptable since I am spilling the fuel counter in trampolines in both directions. The OOG check can be performed in the trampoline or asynchronously as well. I thought that would be enough. Am I missing something?

Also to make sure I understand, the 5% overhead is basically incrementing a pinned register which is fuel consumed, right? And presumably one less register in register allocation as well, too.

Yes, that's my understanding.

alexcrichton commented 2 years ago

Ok yeah that makes sense to me at a high level at least. That would make cross wasm/host calls a bit slower with all the trampolines but that seems reasonable and isn't the end of the world. (e.g. on-part with other slowdowns and seems like a good tradeoff of complexity)

So again though to make sure I understand everything, the changes would be:

Fuel becomes a pinned register
Fuel can be configured rely on an asynchronous signal for detection of out-of-fuel rather than today's synchronous detection
Wasm-to-host trampolines need to be added to Wasmtime. These trampolines will perform a synchronous fuel check and flush the counter to the VMContext.
Host-to-wasm trampolines would be updated to check the fuel counter and then insert it into the pinned register.
Wasmtime would install a signal handler (for which signal?) for the asynchronous notification to check fuel. If the signal interrupted wasm code itself we check the pinned register against the limit (probably pre-configured in the TLS state we maintain or something like that?). On out-of-fuel this raises a trap (how does this happen? Is this an "async trap" or do we need a "landing pad" for each function of where to raise a fuel trap?)
We'd leave things up to the embedder to send the signal itself (how does this happen? Do we keep track of the current thread within some Store<T> state? Do we rely on the user to get the target thread right? Can we do this portably?)

Hm so actually writing this out, there's actually a fair amount of design space here. I think it would be good to write up something describing the new fuel system first perhaps to review before writing all the code? I don't think this needs an RFC per-se but something at that level of technical detail would be useful to help judge how best to fit this within Wasmtime. For example we could leave the signal handling and signal sending entirely up to the embedder and simply document that a specific register has the fuel amount (plus maybe adding public signal-safe methods to test whether something is a known wasm program counter)

pepyakin commented 2 years ago

To be fair, right now, I am only at the proof-of-concept stage. Probably it will take a couple of iterations to come up with the final design. That said, I was leaning towards something that you described in the end. In fact, I already have a prototype of that locally.

To kick off the discussion, here is my vision of the minimal slacked fuel metering:

Add a new configuration option slacked_fuel_metering(bool). That option can only be enabled together with consume_fuel. It is also saved in tunables so that it's accessible in func_environ.
If consume_fuel is enabled, we also set cranelift flag enabled_pinned_reg to true
In func_environ, we would stop relying on cranelift_frontend SSA def-use var for self.fuel_var. Instead, use_var becomes get_pinned_reg(I64) and def_var becomes set_pinned_reg.
In func_environ, if slacked_fuel_metering enabled then we skip fuel_load_into_var, fuel_save_from_var and fuel_check.
In compiler.rs, we change the trampoline generation for host-to-wasm and wasm-to-host[^1]. In both cases, we surround the code that performs the call with code that flushes/reloads the fuel counter. This code is predicated on the slacked_fuel_metering[^2] option. The flush/reload code snippets are very similar to those found in func_environ[^3], with the addition that a fuel check is made before entering or after leaving the wasm code.
Note that the previous item promotes trampolines from just argument adapters to dealing with the fuel counter. We will have to use the trampolines in some places where they are not needed before. One of such places is calling the start function. It's easily solved, though: just lookup_trampoline, and that's it. As it was mentioned, we will have to fix/remove the host_trampoline.
We introduce the wasmtime_runtime::is_wasm_pc function, which returns true if the given PC points to any wasm code we generated.[^4]

That should give a minimal toolset to allow the embedder to implement slacked fuel metering. One possible instantiation for Linux x86_64:

In the setup phase, the embedder installs a signal handler for SIGUSR1. The flags should include SA_SIGINFO.
The embedder saves the current TID before entering the wasm code and spawns a thread that periodically (say, 10-100 Hz) calls kill with the saved TID and the SIGUSR1 signal.
The embedder extracts RIP & R15 (pinned fuel reg) in the signal handler and makes the wasmtime_runtime::is_wasm_pc query. If it returns false, the signal handler returns. Then, if r15 ≥ 0, the signal handler calls a trap raising routine (I used raise_lib_trap since it does not require allocation, which is not signal-safe).

Here are some questions.

The embedder example above does not account for the fuel_adj component. I am not sure how it gets accounted for in the compiled code. It just seems that the maximum fuel available is i64::max_value() rather than u64::max_value. Anyway, in case I just don't understand how it works and we want to fix that for that example, we should be able to reach out to the store from the signal handler to account for that properly. I am not sure how to do that cleanly.
There is a potential race between a trap and the detection of OOG. For example, there's a function with unreachable in it. The fuel counter is increased just before ud2. If the async signal is triggered with RIP pointing just after the add but ud2, that is reported as OOG. Otherwise, it's a trap. If we care to distinguish between those, then traphandlers should check the fuel pinned register first. If there's an overflow, it should be reported as OOG and not the original trap reason (in this case, explicit unreachable).
You mentioned the async trap and landing pads. I don't know what you mean by an async trap. I also don't understand what you meant by 'do we need a "landing pad" for each function of where to raise a fuel trap'. Potentially, the async fuel check signal can arrive anywhere in the wasm code.

[^1]: You seem to assume in your previous message that wasm-to-host trampolines do not exist, but they do. [^2]: That would require us to pass tunables into the functions that generate trampolines. [^3]: The complication is we need to pass the VMOffsets there to know how to get to the VMContext->VMRuntimeLimits→consumed_fuel. [^4]: The existing private IS_WASM_PC returns true if the code points to an instruction that can produce a trap.

alexcrichton commented 2 years ago

One of my main worries is from an engine perspective I don't think we want to tie our hands too much with respect to whatever the initial implementation looks like. In what you're proposing there's a fair bit of open-coding around things like:

Free-functions to test whether something is a wasm pc or not. In general embedders don't have access to crates like wasmtime_runtime and instead we want to expose everything from the wasmtime crate. Additionally where possible we ideally want to scope things at least on an Engine to have access to configuration settings and other necessary "global" state or such.
The pinned register is r15 and that's all you need for fuel. I think we want to get here from a technical implementation perspective but committing to this for a long time could be difficult. As you mentioned we may need to reach into the VMContext which we currently can't do from a signal handler. For now I think we should design fuel to not need to reach into the VMContext but leaving open the possibility of something like two pinned registers in the future I think would be good.
We can raise a trap without any extra context via the raise_lib_trap. IIRC on macOS and maybe other platforms from a signal handler to raise a trap we have to actually return from the signal handler to raise the trap (this bit of code). I realize macOS primarily does mach port stuff but in unix signal mode we can't actually longjmp immediately to get out of the signal handler. Also as a small aside I don't think you need to worry about async-signal safety here. That's all handled in the "is this a wasm pc" test, where once that returns true we know it's safe to call malloc and things. (if "is this wasm" returns false, though, then everything up to and after must be async-signal-safe though)

(etc)

I think ideally we want to design things to be as flexible for the engine as we can while still satisfying your use case. Along those lines I think something like this may help:

impl Engine {
    unsafe fn maybe_handle_out_of_gas(&self, info: *mut libc::siginfo_t) -> OutOfGas;
}

enum OutOfGas {
    NotInWasm,
    SigInfoUpdated,
    // Unwound,
}

The rough idea here being that we try to leave details like exact registers and exact "jump out of here to generate a trap" strategy to Wasmtime as much as possible. This function would return SigInfoUpdated on macOS for example and would directly longjmp out on Unix (or other applicable platforms). We could then also have a function that takes a *mut CONTEXT argument for Windows or something like that.

To give thoughts on your questions:

The embedder example above does not account for the fuel_adj component

I forget what this is used for, but if we can I think we should ensure that the wasm code simply needs to look at the pinned register to determine if it's out of fuel (same for the signal handler). If that's not currently the case then I think we should refactor a bit to make this applicable (e.g. adjust the fuel handling in the trampolines or similar)

There is a potential race between a trap and the detection of OOG

I think you have a good idea here of checking for out-of-gas in the trap handler code. That seems reasonable to me to ignore the original trap and switch it to out-of-gas to try to make that deterministic (as fuel is supposed to be deterministic after all).

The caveat to this is that while host-level side effects are guaranteed to be deterministic (as we always check for fuel in trampolines when we come back to the host) the wasm-level side effects aren't guaranteed to be deterministic. For example memory modifications are going to be nondeterministic in the face of fuel since we can't precisely stop the wasm. This seems like a reasonable tradeoff to me, though, and just something to mention in the slacked fuel configuration documentation. Basically 100% determinism is fuel, and only host-side-effect determinism is guaranteed with slacked fuel. (or something like that)

You mentioned the async trap and landing pads

Ah sure, let me expand a bit. When I say "async trap" I mean the idea of an asynchronous signal coming into wasm and raising a trap at any time. Right now wasm can only raise traps at defined points but with slacked fuel any wasm pc could be the origin source of a trap. Raising a trap from any pc in a module from a signal handler isn't trivial to do and will be tricky to get right across platforms. (I think even just getting it right on only Linux is like 90% of the trickiness)

When I say "landing pad" one scheme for implementing this is that when out-of-gas is detected then the running context of the thread is updated in-place. For example if all wasm functions had a ud2 instruction at the end that was registered as the "out of gas trap area" then we would update RIP to point to this ud2 instruction and then resume the thread. The thread would then immediately fault and trigger the normal trap logic. Having a landing pad I think can be advantageous because it would support Windows (which doesn't have asynchronous signals but you can suspend/resume a thread) and possibly macOS better too (since no sort of fake stack frame needs to be created). Even on Linux it would help centralize all the unwinding logic into the existing trap handling function instead of adding a new entry point of this custom handler.

That being said I'm not sure that this is the best implementation strategy. Our DWARF-based backtracing right now probably won't work super well with this which can hurt debuggability relative to raising a trap in the SIGUSR1 handler, for example. In general I'm personally quite paranoid of signals and want to do as little as possible in signal handlers, centralizing all the complications in one place if possible instead of having multiple handlers do the same thing (especially if these handlers are managed by the embedder rather than Wasmtime). My hesitation is that this is implementation-wise complex relative to what you've probably already got working.

alexcrichton commented 2 years ago

Also, to clarify, I don't think you need to implement support for anything other than your platform of interest. Despite this though I think it's still worthwhile to consider other platforms during the design process so that if and when someone comes a long and is interested in getting this feature working for a new platform we have a way for them to do it which doesn't involve rewriting too much. I don't want to get lost in the weeds of theoreticals of other platforms though.

pepyakin commented 2 years ago

I agree with your points that we should try to hide the guts that are inconsequential and that we should try to have at least an idea of how it could work on other platforms.

I acknowledge that the wasm environment would be incoherent after an async OOG trap. That's fine. In fact, this idea came to me after focusing on the observation that we discard the wasm environment anyway. We only want to know if the code successfully returned before the deadline.

Also, thanks for all the help. It was incredibly helpful so far.

I think even just getting it right on only Linux is like 90% of the trickiness

Can you expand on this bit? I cannot see any trickiness (in the sense of ignorance). My thinking here is:

We only really trap if the PC is within the wasm code. If the code is in host functions, libcalls, or trampolines we do not trap.
Trapping in prologues or epilogues is fine. We discard the stack anyway.
Some of the updates currently could be assumed to be atomic/uninterrupted, but that should be fine:
- I imagine a trap in the middle of a 64-bit store on a 32-bit machine can lead to tearing. Probably that can include the fuel counter itself. For our use case, we can assume 64-bit hardware though. I guess this might happen anyway with e.g. v128 but we already accept the inconsistency of the wasm environment in case of a trap.
- All non-atomic complicated stuff is done in libcalls anyway.

Or do you mean something else?

Having a landing pad I think can be advantageous because it would support Windows.

Can't we simulate an async trap with a shim trick similar to what we do on macOS? Something along those lines:

we suspend the thread of the interest,
check R15. If it's less than 0, we resume the thread and bail. Otherwise, go to the next item.
we modify the PC of the context to point on an ud2. The ud2 can be anywhere.
we resume the thread. The thread executes ud2, traps, and the control gets transferred to the VEH, where it's handled as a standard trap.

I forget what this is used for, but if we can I think we should ensure that the wasm code simply needs to look at the pinned register to determine if it's out of fuel (same for the signal handler). If that's not currently the case then I think we should refactor a bit to make this applicable (e.g., adjust the fuel handling in the trampolines or similar)

After looking at the code once more, I am confused about why we even have it in the first place. From what I can tell, the fuel counter is effectively a 32-bit number. Shouldn't we admit the counter is a 32-bit value from a user's perspective?

Our DWARF-based backtracing right now probably won't work super well with this which can hurt debuggability relative to raising a trap in the SIGUSR1 handler

What's the worry here? Is that we don't see the backtrace from the precise place/function where it was triggered?

I also share your concerns about smearing complications all over the place. However, I am concerned that it might be too much for me to deliver on this 😅 I assume I will need to work on machinst for that, and I lack the required knowledge for it, and also DWARF makes me anxious.

I also assume that we can change the strategy later.

alexcrichton commented 2 years ago

I think even just getting it right on only Linux is like 90% of the trickiness

Can you expand on this bit? ...

In some sense the setjmp/longjmp we do is the easiest way to handle this today where we can technically longjmp from any point. I think I may be getting ahead of myself by alluding to trickiness here. I'm a bit worried about future proposals of wasm and how we're going to handle them, such as:

Stack unwinding to get a wasm backtrace, which technically isn't required for correctness but is useful for debugging, is not easy with asynchronous exceptions. Asynchronous dwarf unwind information is much larger than normal unwind information and otherwise custom schemes (e.g. JIT/cranelift-specific ones) would all have to be sure to handle asynchronous unwinding as well.
Related to stack unwinding implementing the exceptions proposal would probably be tricky in the face of asynchronous traps here. (again this is very forward-looking and not necessarily a concern today; additionally it may be the case that traps don't interact with the exceptions proposal at all). One of my worries here is that the implementation strategy for the exception-handling proposal is probably going to subsume longjmp/setjmp we have for traps today since we want to get rid of that anyway.
One theoretical use case of fuel is resumption of execution to use fuel to time-slice. This means that a possible application of this feature would be to use the signal to context switch out but a big thing here is we want to resume later on. I suspect resumption from an async signal is trickier than simply trapping (e.g. landing pads to "just" do a ud2 wouldn't be sufficie)t.
In the case of a single ud2-per-function which serves as the actual vector for the trap there's some work to be done to thread through that metadata all the way to the signal handler which isn't the end of the world but not necessarily trivial.
One concern which is actually more concrete is that writing async-signal-safe code is not easy. Currently our signal handler is not async-signal-safe because it attempts to acquire an rwlock in the handler to read the global module metadata to determine if something is a wasm module or not. This is sort of ok today since it only runs on segfaults and our current stance is "well things are already really wrong on a non-wasm segfault anyway", but with a timer signal this is something we'd have to worry about.

To clarify again though I don't mean to try to dump everything on you at once though. I think it's fine to keep this focused on your use case at hand and if it doesn't expand then it doesn't expand. We'll need to take these things into account when reviewing, though. In general though my past experience with asynchronous exceptions/signals is "there be dragons" so I'm also trying to make sure we've got all our bases covered here as well.

Having a landing pad I think can be advantageous because it would support Windows.

Can't we simulate an async trap with a shim trick similar to what we do on macOS? ...

Indeed that was my thinking!

Shouldn't we admit the counter is a 32-bit value from a user's perspective?

That may be a bug... but yes if it's 32-bit we should probably update the 64-bit values in the Store to 32-bit!

Our DWARF-based backtracing right now probably won't work super well with this which can hurt debuggability relative to raising a trap in the SIGUSR1 handler

What's the worry here? Is that we don't see the backtrace from the precise place/function where it was triggered?

I alluded to this above, but concerns here are pretty forward-looking related to possibly different implementations of traps, unwinding, exceptions, etc. I think it's safe to ignore these for now and say "we wouldn't support capturing backtraces from the async handler" and leave it at that.

pepyakin commented 2 years ago

After working on the maybe_handle_out_of_gas-like approach realized that it may be worth exploring the high-level and user-friendly API. Initially, my main concern was that I did not want wasmtime to clash with the user code by reserving a signal. After you gave me the context about the future features, I realized that a high-level API would provide us with more leeway regarding implementation strategy and that reserving a hardcoded signal is not a big issue. Also all complications, like writing async-signal-safe code, are shifted to wasmtime which is a good thing.

The user-friendly API I have in mind now looks like the following:

/// A handle bound to a thread. Allows for checking the fuel in the thread it is bound.
/// Send + Sync
pub struct AsyncFuelChecker(...);
impl AsyncFuelChecker {
  /// Schedules the fuel check to be performed on the thread this checker is bound to. Make sure to call it often enough
  /// to avoid the fuel counter wrapping around between the calls.
  ///
  /// The fuel check has effect only if the control is in wasm. It's your responsibility to make sure that host functions take
  /// into account the fuel consumption. Also, wasm may be stuck in libcall impls of instructions like `memory.copy` which
  /// cannot be effectively interrupted. It's your responsibility to mitigate that.
  ///
  /// No-op if is called on the thread its bound to.
  ///
  /// We may be courteous to provide feedback on whether the thread does not exist but probably nothing more than that.
  pub fn check(&self) { /* ... */ }
}

pub struct Engine {
  /// Returns a async fuel checker for the current thread. Can return `Err` in case the platform unsupported or 
  /// the engine is not configured for slacked fuel metering.
  pub fn current_thread_async_fuel_checker(&self) -> Result<AsyncFuelChecker> { ... }
}

We could reserve the SIGUSR1 and use pthread_sigqueue with a magic cookie parameter to differentiate signals sent by wasmtime. If the signal handler receives SIGUSR1 without the cookie parameter, it will be passed to the previously installed, presumably by the user, signal handler.

One caveat here is that macOS does not really provide anything more than basic pthread_kill. Obviously, one way to solve it is to get away without sending this magic cookie. However, there is another caveat with pthread_kill. Apparently, pthread_kill cannot be used for threads managed by GCD, which is not the end of the world but still annoying.

On Linux/FreeBSD checking and interrupting happens on the same thread in a signal. We don't have to do that on macOS, even with the feature posix-signals-on-macos on[^1]. That is because macOS has thread_suspend, thread_resume, thread_get_state and thread_set_state. Those primitives are the same as Windows provides. Therefore we can do the same sequence as with Windows: suspend, check the fuel, if overflown, modify the PC in the thread's context, and resume. This approach does not depend on any global resources such as signals, so we are golden on macOS and Windows.

[^1]: although we still have to perform the fuel check in the trap to avoid the race condition I mentioned here (grep "There is a potential race between a trap and the detection of OOG. For example, ")

With that here is the second iteration.

The compilation part is almost the same so I quote it.

Add a new configuration option slacked_fuel_metering(bool). That option can only be enabled together with consume_fuel. It is also saved in tunables so that it's accessible in func_environ.

If consume_fuel is enabled, we also set cranelift flag enabled_pinned_reg to true

In func_environ, we would stop relying on cranelift_frontend SSA def-use var for self.fuel_var. Instead, use_var becomes get_pinned_reg(I64) and def_var becomes set_pinned_reg.

In func_environ, if slacked_fuel_metering enabled then we skip fuel_load_into_var, fuel_save_from_var and fuel_check.

In compiler.rs, we change the trampoline generation for host-to-wasm and wasm-to-host1. In both cases, we surround the code that performs the call with code that flushes/reloads the fuel counter. This code is predicated on the slacked_fuel_metering2 option. The flush/reload code snippets are very similar to those found in func_environ3, with the addition that a fuel check is made before entering or after leaving the wasm code.

Note that the previous item promotes trampolines from just argument adapters to dealing with the fuel counter. We will have to use the trampolines in some places where they are not needed before. One of such places is calling the start function. It's easily solved, though: just lookup_trampoline, and that's it. As it was mentioned, we will have to fix/remove the host_trampoline.

I say almost the same because pinned_reg, which is essential for this technique, works only on 64-bit platforms currently.

Obviously, we don't need to export wasmtime_runtime::is_wasm_pc as in the first iteration, but we need it internally for the implementation. There is a caveat that I did not see before: in the presence of several engines, it won't be enough to only have the answer if the given PC refers to wasm code, but we also should know if that code was generated with fuel in pinned register.

Then I suppose Engine::current_thread_async_fuel_checker would check if the slacked fuel metering is enabled and also if the current platform and arch are supported. Ultimately, it should be Linux, FreeBSD and macOS on x86_64 and aarch64. If the check does not pass, it returns an Err.

The Linux/FreeBSD side of implementation:

AsyncFuelChecker::new notes the current thread using pthread_self.
AsyncFuelChecker::check bails if pthread_self is equal to self.tid. Otherwise calls pthread_sigqueue(self.tid, SIGUSR1, MAGIC). MAGIC could be a random number or maybe an address of a private static field for sival_ptr.
we plumb a bool into platform_init which indicates if slacked_fuel_metering is enabled. If it is, then we register trap_handler for SIGUSR1. Again, we do that only for Linux/FreeBSD. I am not sure about SA_NODEFER though since we don't want reentrancy for SIGUSR1, but the same signal handler is used for other things.
In CallThreadState we add another parameter which indicates if the slacked fuel metering enabled. It's plumbed through catch_traps.
In the signal handler, before jmp_buf_if_trap (or maybe in it, but the important part that we remember that OOG has prevalence over a normal trap) we check if the slacked fuel metering is enabled. If so, then we check if the PC falls in any jit generated wasm code. If so, we extract r15/x21 and check if it is positive. If so, we note the unwind reason (initially, without a backtrace) and jump out via longjmp. If the signum is SIGUSR1 then we swallow the signal (similar to how the signal handler would return if the signal was handled by the user provided signal handler callback).

The macOS side of implementation:

AsyncFuelChecker::new notes the current thread using mach_thread_self.
AsyncFuelChecker::check checks if mach_thread_self equals to self.tid and if so bails. Otherwise, calls thread_suspend with the self.tid. Then extract the current thread state through thread_get_state. If the context's pc points into jitted wasm code we examine r15/x21. If it is positive we modify rip/pc to a shim by calling thread_set_state not forgetting to align the stack if needed. The shim will save the unwind reason (currently capture no stacktrace) and longjmp with the buf extracted from tls. After calling thread_resume the function returns. All errors are ignored, in particular the thread non-existance is expected. Other errors may be logged with debug level.
In CallThreadState we add another parameter which indicates if the slacked fuel metering enabled. It's plumbed through catch_traps.
For posix-signals-on-macos we modify the signal handler in the same way as we did for Linux (in fact, it's the same code, we just need to make sure it works in both cases). SIGUSR1 is not involved in this case.
For mach we should check if the slacked fuel metering is enabled. If so, we should check if the pc is in wasm and if the r15/x21 is positive and if so resume to the out-of-gas shim.
We should keep in mind that the trap handler should also check the fuel register, so that in case of a race with another trap we report OOG since that should've happened first.

The Windows implementation is similar to macOS' but with the corresponding APIs and without fuffing with posix-signals-on-macos.

Now, a couple of notes:

Probably if I were to implement it I would limit myself to Linux (maybe FreeBSD) x86_64 initially.
The traps can be approached with the mickey-mouse way I described above or directly with landing pads. The doors are open thanks for the high-level API. I'd start with a mickey-mouse impl first.
I am not as worried about debuggability in this case. I am not entirely sure how useful would be to know at which point exactly the gas runs out, perhaps with exception if the problem in infinite loop. I personally never had to debug one with all the time I worked with wasm. Furthermore, slacked fuel metering will show you any stack trace that goes after the point where it actually ran out. Also, if you work with fuel, chances are you work with deterministic code and it will be possible to re-run the code but this time with the deterministic fuel metering. That said, I agree if we can we should add them.

alexcrichton commented 2 years ago

That all sounds great to me, thanks for writing this all up! Personally I don't have any other comments at this time, I'll poke around and see if others have thoughts too though.

pepyakin commented 2 years ago

Hey @alexcrichton, have you had a chance to source some thoughts on this one? What would be the next steps here? Would a prototype help at this point?

alexcrichton commented 2 years ago

I asked around but no one posted here, so I think that's either ETOOBUSY or "seems fine". It seems reasonable to me to start with a prototype unless someone else feels differently.

pepyakin commented 2 years ago

Focus on this

2. AsyncFuelChecker::check bails if pthread_self is equal to self.tid. Otherwise calls pthread_sigqueue(self.tid, SIGUSR1, MAGIC)

I was assuming that, unlike raw TIDs/PIDs, pthread_t is not vulnerable to the reuse problem. But turns out it is! pthread docs state that using a pthread_t for a thread that ended its lifetime is not safe. You cannot use pthread_kill(tid, 0) to check if the thread is alive. I was not able to find a way that checks this.

Everywhere I look said that one should either pthread_join or have a bool variable that would indicate if the thread is running. In our case, AsyncFuelChecker does not own the thread and cannot join it. The bool flag option seems to be better at the first glance, but looking closer it is also not ideal. To give an example why imagine we create a RAII guard that sets the flag to true before entering into wasm allowing AsyncFuelChecker to send signals. However, there is a subtle side-effect: the thread may get killed without running the destructors. That must be safe since forgetting stuff must be safe, but it is not: the flag is still on but the pthread_t is now invalid.

I went through a couple of possible alternative strategies:

pidfd seems to be a solution. However, it is a bit too fresh. It requires 5.x. Debian Buster (LTS until 2024) is still on 4.19.
ptrace allows us to catch the process exit. Moreover, it has all the primitives (suspend, read/write regs, resume) for implementing macOS-on-ports and Windows-like checks. It seems, however, to be a bit heavy API which I don't completely understand. Not sure if there are no underwater rocks present.
We may rework the API so that it assumes control over the thread that is checkable. For example, that would allow to spawn the thread using clone with CLONE_CHILD_CLEARTID.

None of those seem ideal.

I am wondering if maybe I am missing something obvious? Maybe I am paying too much attention to this edge-case? Or maybe it is OK to require a new kernel and maybe gracefully degrade to YOLO kill with the saved TID? Or if there are better suggestions how to solve this?

alexcrichton commented 2 years ago

I don't think that pthread_kill is safe in any Rust world so I think it's fine to assume that destructors run in the thread. That'd enable a TLS variable to be created where the destructor sets the flag and then the signal isn't sent if the flag is set.