Open pepyakin opened 2 years ago
One of the primary reasons for fuel as oppose to epochs is due to the deterministic nature of fuel, which I think is why a scheme like you've proposed where with asynchronous checks wouldn't entirely supplant fuel but would perhaps be another method of limiting wasm execution in Wasmtime. In some sense though what it seems like you're looking for is a hybrid of epochs (asynchronous cancellation, low overhead) and fuel (the measure of cancellation being how much code has run). I wonder if that might make more sense? For example if you could read (but not write) fuel from another thread then you could increment the epoch when fuel runs out or something like that.
In terms of the cost of fuel, though, I believe the main cost associated with it is not the checks for out-of-fuel but rather the management of the fuel counter itself which has lots of loads and, critically, stores to memory. The repeated memory traffic ends up being quite costly as oppose to epochs which is simply always reading the same memory location which should be more efficient (and from our benchmarking the efficiency of epochs is much better).
I suppose that I can put this a little differently in that if the fuel check was moved asynchronously I'm not sure it would buy much. If the check itself is moved externally then that loses the determinism of fuel, and I also think that the main cost associated with fuel will still be present, namely the management of the fuel counter itself.
There does seem to be some utility to a "count the ops but don't interrupt" configuration for fuel, in that some users may be interested in basically a perf counter/stat but don't need to set limits. I agree that the removal of the control flow will only remove a part of the overhead for fuel; but it should significantly shrink code size, which is something.
I guess mostly I'd love to see measurements: @pepyakin have you experimented with this already? If not, would you be willing to hack something together? It shouldn't be too complex: basically we want to return from a codegen helper here (just after fuel_increment_var
, before we create the blocks and add the compare/branch).
(For clarity, if this shows a sufficient benefit for your use-case, then I agree with Alex's suggested hybrid approach above: the way to go is to make it a config option to count fuel without interrupts, and then use epoch interruption to actually halt execution asynchronously.)
In some sense though what it seems like you're looking for is a hybrid of epochs (asynchronous cancellation, low overhead) and fuel (the measure of cancellation being how much code has run). I wonder if that might make more sense? For example if you could read (but not write) fuel from another thread then you could increment the epoch when fuel runs out or something like that.
Yes, this is similar to what I was looking for.
Although I would still clarify,
Given point 2 this can be deterministic: ignoring timing, multiple executions of the same program with the same inputs will lead to the same outcome.
I was also wondering about the same construction that you propose. I assumed that that approach would require memory writes on each loop header & function call and those have to be atomic, and I assumed that would be even more overhead.
In terms of the cost of fuel, though, I believe the main cost associated with it is not the checks for out-of-fuel but rather the management of the fuel counter itself which has lots of loads and, critically, stores to memory. The repeated memory traffic ends up being quite costly as oppose to epochs which is simply always reading the same memory location which should be more efficient (and from our benchmarking the efficiency of epochs is much better).
I might have explained myself poorly.
All this faffing around with safepoints I proposed was in the hope, that we can avoid not only compare&branch but the memory writing as well. It is avoiding, not eliminating though since the fuel counter still can be spilled.
I guess mostly I'd love to see measurements: @pepyakin have you experimented with this already? If not, would you be willing to hack something together? It shouldn't be too complex: basically we want to return from a codegen helper here (just after
fuel_increment_var
, before we create the blocks and add the compare/branch).
So all of the above are just uneducated guesses, turns out. I've tried to make a benchmark but then I realized that did a mickey mouse mistake, the one about DCE I mentioned in OP.
I'm too will love to see measurements and willing to hack. I am not sure what would be the best way to do that though.
So all of the above are just uneducated guesses, turns out. I've tried to make a benchmark but then I realized that did a mickey mouse mistake, the one about DCE I mentioned in OP.
Ah, sorry, I had missed that note; but it's a little surprising given that here we call fuel_save_from_var
to store the value back to memory, and this is invoked at every function exit. Stores shouldn't be DCE'd (since they are side-effecting). Can you dump the CLIF (RUST_LOG=trace
should show it buried among a bunch of other output) and see what it looks like just after wasm translation?
I assumed that that approach would require memory writes on each loop header & function call and those have to be atomic, and I assumed that would be even more overhead.
With epochs the increment of the epoch is an atomic write but all the reads are "just" atomic reads which in practice ends up being quite fast. The instrumented code itself doesn't do any atomic writes anywhere (same for fuel, not atomic writes anywhere)
Given your clarification (thanks!) it does sort of sound like you want fuel + epochs without the actual interruption of fuel though, just the accounting. This may also need to be specialized a bit in codegen to make things a bit more optimal but otherwise seems pretty close?
Ah, sorry, I had missed that note; but it's a little surprising given that here we call
fuel_save_from_var
to store the value back to memory, and this is invoked at every function exit. Stores shouldn't be DCE'd (since they are side-effecting). Can you dump the CLIF (RUST_LOG=trace
should show it buried among a bunch of other output) and see what it looks like just after wasm translation?
Was a mistake on my part, I was doing something else. Modifying fuel_check
skipping everything after fuel_increment_var
as you proposed works.
So the results with my microbenchmarks are:
fuel overhead with interrupts 24-34%, fuel overhead without interrupts 10-25%
which is something, but ideally we lower it even more.
This may also need to be specialized a bit in codegen to make things a bit more optimal but otherwise seems pretty close?
Could you elaborate on this bit?
FWIW, enabling epoch interrupts and consuming fuel (but without fuel interrupts) gives 28-40% overhead.
ideally we lower it even more.
So, some fairly complex ideas incoming: there are at least research-level ideas floating around (I can't find any papers at the moment unfortunately but I know I've seen it before) on how to minimally/optimally record the trace (path taken through a program). Counting ops is equivalent to this problem so I think the same ideas apply. It goes something like: choose a set of "profiling points" (here, points at which you increment the counter), such that every branch outcome can be deduced from the points that are reached.
The simple/intuitive cases are loops (one profiling point in the header) and if/then diamonds (one profiling point in each branch) and in fact that's what we do today, rather than a profiling point in every block. But it actually starts to get interesting even with the if/else: we really only need a profiling point in one branch of the diamond. E.g., if the if-path takes 10 ops, and the else-path takes 15 ops, then we unconditionally add 10 (probably as part of the profiling point at the top of the function or enclosing loop), and add 5 more only on the else-path. Bonus points if we choose the rare path for the additional increment.
The general answer to this problem I think will have something to do with dom/postdom relations -- as a first pass, can we merge a block's profiling point (fuel increment amount) into the increment of any block it postdominates? And then there might be another transform where we draw a "postdom cut" across N blocks relative to some block, and hoist some common factor k out of all of those blocks (i.e., if anywhere we go will eventually increment by at least 10, let's hoist that). It has to be a little more subtle than that because loop bodies can postdom the post-loop tail, so we really need some sort of "1-to-1 execution count" relation. But I'd have to sit down and think more (or just find the papers on this!).
Ah, one more idea that just came to mind: we could potentially remove a lot of the memory-traffic overhead of fuel counting by defining a new ABI with a "pinned fuel counter". Then it only needs to be loaded from the vmctx in host-to-wasm trampolines, and we could "spill" to vmctx at callsites to non-pinned-fuel-counter ABIs (direct calls to imported host functions for example).
So, some fairly complex ideas incoming...
Oh, right, something like this also crossed my mind.
It looks that the ABI idea is easier to hack though. You don't mean literal new ABI (as in a new variant in CallConv
), since it seems to me that you can get away with changes only in crates/cranelift
though, right?
It looks that the ABI idea is easier to hack though. You don't mean literal new ABI (as in a new variant in CallConv), since it seems to me that you can get away with changes only in crates/cranelift though, right?
We could do it that way; but passing the value then has to be explicit (argument into every function, and extra return value from every function). Also slightly less efficient since those would be different registers, so some extra moves.
I had been thinking a flag on the Cranelift options that means "reserve a register that is globally allocated for this purpose, and provide CLIF ops to get and set it", but actually we already have that: I realized that we aren't using the "pinned reg" for anything in Wasmtime's wasm-to-CLIF lowering right now, so we could set that option and then use get_pinned_reg
/ set_pinned_reg
to "load" and "store" the fuel count.
We'd still need to transfer it to and from the vmctx when transitioning from host to Wasm code (in the trampoline) and from Wasm to host code (this one is a little trickier).
Well, I went ahead and implemented the explicit value passing. It's messy and now I stumbled upon some problem in the wasm translation code. I suspect there is some hidden dependency on the fact that there are only two hidden params. I am debugging it right now. The trampoline code is already in place though.
Regarding get_pinned_reg
/set_pinned_reg
, that's funny, I looked at it but thought it's not worth of exploration thinking that it's a remnant of the old style backend. If that works, that means cranelift always has a reserved register that is not used for anything then?
UPD: Nvm, I see there is enable_pinned_reg
.
IIRC there's a fair bit of open-coding about our two implicit VMContext
parameters. I think it would be great to reduce this open-coding if we can, though. But if you're hitting issues with our current two hidden parameters that's probably what's happening.
Otherwise though one important bit about ABI details and pinned registers is that right now Wasmtime's API relies on the fact that trampolines can be defined natively in Rust. This is tested currently and will break if there's ABI details Rust can't uphold (like preserving the pinned register). Overall I personally feel this is a mis-feature of Wasmtime as the only thing it buys us is the ability to Func::call
a native function wrapped up via a Func
, which no one really wants to do. Instead I think we should use the trampoline registration system that is already in place for modules and just simply require that lookup succeeds (e.g. documenting that Func::call
in a store with no modules will panic). That would mean that we no longer have to define trampolines in Rust (and that host_trampoline
function could be deleted) and all trampolines would be assembled by Cranelift. We'd still have to define a host ABI for the trampolines to talk to Rust but having that transition point where Cranelift is always in control seems prudent.
This latter point though of refactoring how trampolines-to-imported-host-functions work would require some relatively deep refactorings. I think something about the VMContext
would have to change to support host trampolines, but I'm not sure precisely what having not designed it yet.
Basically I just wanted to mention that dealing with a pinned register and/or non-System-V ABIs I think is a good idea and something we should enable, but I don't believe it can trivially be done so today without more significant refactorings of what a trampoline looks like when wasm enters the host. (and enabling this all would probably best be done as a separate PR)
Hacking on the explicit passing of the fuel counter without storing it in the memory, I bumped on a bug. It manifests with fuel metering and the unreachable
instruction. Basically, the transfer of control out of there means the fuel counter should be flushed just before the trap. That's easy enough.
However, other traps are tricky: memory accesses (at least with the eliminated bounds checked / mprotect) and stack overflows. If we were to flush the fuel counter, it would have to be done pretty much everywhere. I think it's not impossible that it would fare worse than vanilla fuel metering. Am I missing something?
The pinned fuel counter approach proposed by @cfallin might work in this case. I assume traps always transfer control through a signal handler. We could extract the contents of the pinned register from the last frame before the trap from within the signal handler and then flush it to VMRuntimeLimits
.
I read from @alexcrichton's post that the codebase is not ready for this change yet. However, I am trying to get a sense of potential wins here, not a final solution. Do you folks see any roadblocks to hacking the solution here? If not, I might try to tackle that approach then.
I think reading the fuel in the signal handler and writing it there instead of in the jit code would be a good approach. While getting this all 100% correct will be difficult I suspect you could "just implement it" today and get some good perf numbers for what the final solution would look like performance-wise.
I've hacked a version that uses the pinned_reg approach. I did not bother implementing extracting fuel in the signal handler, though. I just checked that it is possible, and it seemed positive.
The results are mixed: one benchmark suffers from a slowdown. I haven't isolated the root cause.
However, I ran an extra experiment. Basically, I just omitted calls to fuel_check
.
Similar to the one I described in OP, the idea is to periodically send a signal to interrupt the thread to check the counter value. Except, it is way simpler,
According to my measurements, the overhead is around ≈5% relative to no fuel metering. This result is promising.
Do you think this approach is something that could be upstreamed?
That sounds pretty reasonable to me, although I think there's probably trickiness to work out along the lines of a signal being sent to a thread that isn't currently executing wasm, e.g. it's in the middle of an import or something like that. In that situation is it something that's ok to unconditionally ignore? Or should the signal get "deferred" to get handled once wasm is reentered?
Also to make sure I understand, the 5% overhead is basically incrementing a pinned register which is fuel consumed, right? And presumably one less register in register allocation as well, too.
I assumed we could just check IS_WASM_PC
for the IP while handling a signal. If the control is not inside of the wasm then we ignore it. It should be enough at least for our use cases. I thought that would be acceptable since I am spilling the fuel counter in trampolines in both directions. The OOG check can be performed in the trampoline or asynchronously as well. I thought that would be enough. Am I missing something?
Also to make sure I understand, the 5% overhead is basically incrementing a pinned register which is fuel consumed, right? And presumably one less register in register allocation as well, too.
Yes, that's my understanding.
Ok yeah that makes sense to me at a high level at least. That would make cross wasm/host calls a bit slower with all the trampolines but that seems reasonable and isn't the end of the world. (e.g. on-part with other slowdowns and seems like a good tradeoff of complexity)
So again though to make sure I understand everything, the changes would be:
Store<T>
state? Do we rely on the user to get the target thread right? Can we do this portably?)Hm so actually writing this out, there's actually a fair amount of design space here. I think it would be good to write up something describing the new fuel system first perhaps to review before writing all the code? I don't think this needs an RFC per-se but something at that level of technical detail would be useful to help judge how best to fit this within Wasmtime. For example we could leave the signal handling and signal sending entirely up to the embedder and simply document that a specific register has the fuel amount (plus maybe adding public signal-safe methods to test whether something is a known wasm program counter)
To be fair, right now, I am only at the proof-of-concept stage. Probably it will take a couple of iterations to come up with the final design. That said, I was leaning towards something that you described in the end. In fact, I already have a prototype of that locally.
To kick off the discussion, here is my vision of the minimal slacked fuel metering:
slacked_fuel_metering(bool)
. That option can only be enabled together with consume_fuel
. It is also saved in tunables so that it's accessible in func_environ.consume_fuel
is enabled, we also set cranelift flag enabled_pinned_reg
to true
self.fuel_var
. Instead, use_var
becomes get_pinned_reg(I64)
and def_var
becomes set_pinned_reg
.slacked_fuel_metering
enabled then we skip fuel_load_into_var
, fuel_save_from_var
and fuel_check
.slacked_fuel_metering
[^2] option. The flush/reload code snippets are very similar to those found in func_environ[^3], with the addition that a fuel check is made before entering or after leaving the wasm code. start
function. It's easily solved, though: just lookup_trampoline
, and that's it. As it was mentioned, we will have to fix/remove the host_trampoline
.wasmtime_runtime::is_wasm_pc
function, which returns true
if the given PC points to any wasm code we generated.[^4]That should give a minimal toolset to allow the embedder to implement slacked fuel metering. One possible instantiation for Linux x86_64:
SA_SIGINFO
.kill
with the saved TID and the SIGUSR1 signal.wasmtime_runtime::is_wasm_pc
query. If it returns false, the signal handler returns. Then, if r15 ≥ 0, the signal handler calls a trap raising routine (I used raise_lib_trap
since it does not require allocation, which is not signal-safe).Here are some questions.
fuel_adj
component. I am not sure how it gets accounted for in the compiled code. It just seems that the maximum fuel available is i64::max_value()
rather than u64::max_value
. Anyway, in case I just don't understand how it works and we want to fix that for that example, we should be able to reach out to the store from the signal handler to account for that properly. I am not sure how to do that cleanly.unreachable
in it. The fuel counter is increased just before ud2
. If the async signal is triggered with RIP pointing just after the add
but ud2
, that is reported as OOG. Otherwise, it's a trap. If we care to distinguish between those, then traphandlers
should check the fuel pinned register first. If there's an overflow, it should be reported as OOG and not the original trap reason (in this case, explicit unreachable
).[^1]: You seem to assume in your previous message that wasm-to-host trampolines do not exist, but they do.
[^2]: That would require us to pass tunables into the functions that generate trampolines.
[^3]: The complication is we need to pass the VMOffsets
there to know how to get to the VMContext->VMRuntimeLimits→consumed_fuel
.
[^4]: The existing private IS_WASM_PC
returns true if the code points to an instruction that can produce a trap.
One of my main worries is from an engine perspective I don't think we want to tie our hands too much with respect to whatever the initial implementation looks like. In what you're proposing there's a fair bit of open-coding around things like:
wasmtime_runtime
and instead we want to expose everything from the wasmtime
crate. Additionally where possible we ideally want to scope things at least on an Engine
to have access to configuration settings and other necessary "global" state or such.VMContext
which we currently can't do from a signal handler. For now I think we should design fuel to not need to reach into the VMContext
but leaving open the possibility of something like two pinned registers in the future I think would be good.raise_lib_trap
. IIRC on macOS and maybe other platforms from a signal handler to raise a trap we have to actually return from the signal handler to raise the trap (this bit of code). I realize macOS primarily does mach port stuff but in unix signal mode we can't actually longjmp immediately to get out of the signal handler. Also as a small aside I don't think you need to worry about async-signal safety here. That's all handled in the "is this a wasm pc" test, where once that returns true we know it's safe to call malloc and things. (if "is this wasm" returns false
, though, then everything up to and after must be async-signal-safe though)(etc)
I think ideally we want to design things to be as flexible for the engine as we can while still satisfying your use case. Along those lines I think something like this may help:
impl Engine {
unsafe fn maybe_handle_out_of_gas(&self, info: *mut libc::siginfo_t) -> OutOfGas;
}
enum OutOfGas {
NotInWasm,
SigInfoUpdated,
// Unwound,
}
The rough idea here being that we try to leave details like exact registers and exact "jump out of here to generate a trap" strategy to Wasmtime as much as possible. This function would return SigInfoUpdated
on macOS for example and would directly longjmp out on Unix (or other applicable platforms). We could then also have a function that takes a *mut CONTEXT
argument for Windows or something like that.
To give thoughts on your questions:
The embedder example above does not account for the fuel_adj component
I forget what this is used for, but if we can I think we should ensure that the wasm code simply needs to look at the pinned register to determine if it's out of fuel (same for the signal handler). If that's not currently the case then I think we should refactor a bit to make this applicable (e.g. adjust the fuel handling in the trampolines or similar)
There is a potential race between a trap and the detection of OOG
I think you have a good idea here of checking for out-of-gas in the trap handler code. That seems reasonable to me to ignore the original trap and switch it to out-of-gas to try to make that deterministic (as fuel is supposed to be deterministic after all).
The caveat to this is that while host-level side effects are guaranteed to be deterministic (as we always check for fuel in trampolines when we come back to the host) the wasm-level side effects aren't guaranteed to be deterministic. For example memory modifications are going to be nondeterministic in the face of fuel since we can't precisely stop the wasm. This seems like a reasonable tradeoff to me, though, and just something to mention in the slacked fuel configuration documentation. Basically 100% determinism is fuel, and only host-side-effect determinism is guaranteed with slacked fuel. (or something like that)
You mentioned the async trap and landing pads
Ah sure, let me expand a bit. When I say "async trap" I mean the idea of an asynchronous signal coming into wasm and raising a trap at any time. Right now wasm can only raise traps at defined points but with slacked fuel any wasm pc could be the origin source of a trap. Raising a trap from any pc in a module from a signal handler isn't trivial to do and will be tricky to get right across platforms. (I think even just getting it right on only Linux is like 90% of the trickiness)
When I say "landing pad" one scheme for implementing this is that when out-of-gas is detected then the running context of the thread is updated in-place. For example if all wasm functions had a ud2
instruction at the end that was registered as the "out of gas trap area" then we would update RIP to point to this ud2
instruction and then resume the thread. The thread would then immediately fault and trigger the normal trap logic. Having a landing pad I think can be advantageous because it would support Windows (which doesn't have asynchronous signals but you can suspend/resume a thread) and possibly macOS better too (since no sort of fake stack frame needs to be created). Even on Linux it would help centralize all the unwinding logic into the existing trap handling function instead of adding a new entry point of this custom handler.
That being said I'm not sure that this is the best implementation strategy. Our DWARF-based backtracing right now probably won't work super well with this which can hurt debuggability relative to raising a trap in the SIGUSR1 handler, for example. In general I'm personally quite paranoid of signals and want to do as little as possible in signal handlers, centralizing all the complications in one place if possible instead of having multiple handlers do the same thing (especially if these handlers are managed by the embedder rather than Wasmtime). My hesitation is that this is implementation-wise complex relative to what you've probably already got working.
Also, to clarify, I don't think you need to implement support for anything other than your platform of interest. Despite this though I think it's still worthwhile to consider other platforms during the design process so that if and when someone comes a long and is interested in getting this feature working for a new platform we have a way for them to do it which doesn't involve rewriting too much. I don't want to get lost in the weeds of theoreticals of other platforms though.
I agree with your points that we should try to hide the guts that are inconsequential and that we should try to have at least an idea of how it could work on other platforms.
I acknowledge that the wasm environment would be incoherent after an async OOG trap. That's fine. In fact, this idea came to me after focusing on the observation that we discard the wasm environment anyway. We only want to know if the code successfully returned before the deadline.
Also, thanks for all the help. It was incredibly helpful so far.
I think even just getting it right on only Linux is like 90% of the trickiness
Can you expand on this bit? I cannot see any trickiness (in the sense of ignorance). My thinking here is:
Or do you mean something else?
Having a landing pad I think can be advantageous because it would support Windows.
Can't we simulate an async trap with a shim trick similar to what we do on macOS? Something along those lines:
ud2
. The ud2
can be anywhere.ud2
, traps, and the control gets transferred to the VEH, where it's handled as a standard trap.I forget what this is used for, but if we can I think we should ensure that the wasm code simply needs to look at the pinned register to determine if it's out of fuel (same for the signal handler). If that's not currently the case then I think we should refactor a bit to make this applicable (e.g., adjust the fuel handling in the trampolines or similar)
After looking at the code once more, I am confused about why we even have it in the first place. From what I can tell, the fuel counter is effectively a 32-bit number. Shouldn't we admit the counter is a 32-bit value from a user's perspective?
Our DWARF-based backtracing right now probably won't work super well with this which can hurt debuggability relative to raising a trap in the SIGUSR1 handler
What's the worry here? Is that we don't see the backtrace from the precise place/function where it was triggered?
I also share your concerns about smearing complications all over the place. However, I am concerned that it might be too much for me to deliver on this 😅 I assume I will need to work on machinst for that, and I lack the required knowledge for it, and also DWARF makes me anxious.
I also assume that we can change the strategy later.
I think even just getting it right on only Linux is like 90% of the trickiness
Can you expand on this bit? ...
In some sense the setjmp/longjmp we do is the easiest way to handle this today where we can technically longjmp from any point. I think I may be getting ahead of myself by alluding to trickiness here. I'm a bit worried about future proposals of wasm and how we're going to handle them, such as:
To clarify again though I don't mean to try to dump everything on you at once though. I think it's fine to keep this focused on your use case at hand and if it doesn't expand then it doesn't expand. We'll need to take these things into account when reviewing, though. In general though my past experience with asynchronous exceptions/signals is "there be dragons" so I'm also trying to make sure we've got all our bases covered here as well.
Having a landing pad I think can be advantageous because it would support Windows.
Can't we simulate an async trap with a shim trick similar to what we do on macOS? ...
Indeed that was my thinking!
Shouldn't we admit the counter is a 32-bit value from a user's perspective?
That may be a bug... but yes if it's 32-bit we should probably update the 64-bit values in the Store
to 32-bit!
Our DWARF-based backtracing right now probably won't work super well with this which can hurt debuggability relative to raising a trap in the SIGUSR1 handler
What's the worry here? Is that we don't see the backtrace from the precise place/function where it was triggered?
I alluded to this above, but concerns here are pretty forward-looking related to possibly different implementations of traps, unwinding, exceptions, etc. I think it's safe to ignore these for now and say "we wouldn't support capturing backtraces from the async handler" and leave it at that.
After working on the maybe_handle_out_of_gas
-like approach realized that it may be worth exploring the high-level and user-friendly API. Initially, my main concern was that I did not want wasmtime to clash with the user code by reserving a signal. After you gave me the context about the future features, I realized that a high-level API would provide us with more leeway regarding implementation strategy and that reserving a hardcoded signal is not a big issue. Also all complications, like writing async-signal-safe code, are shifted to wasmtime which is a good thing.
The user-friendly API I have in mind now looks like the following:
/// A handle bound to a thread. Allows for checking the fuel in the thread it is bound.
/// Send + Sync
pub struct AsyncFuelChecker(...);
impl AsyncFuelChecker {
/// Schedules the fuel check to be performed on the thread this checker is bound to. Make sure to call it often enough
/// to avoid the fuel counter wrapping around between the calls.
///
/// The fuel check has effect only if the control is in wasm. It's your responsibility to make sure that host functions take
/// into account the fuel consumption. Also, wasm may be stuck in libcall impls of instructions like `memory.copy` which
/// cannot be effectively interrupted. It's your responsibility to mitigate that.
///
/// No-op if is called on the thread its bound to.
///
/// We may be courteous to provide feedback on whether the thread does not exist but probably nothing more than that.
pub fn check(&self) { /* ... */ }
}
pub struct Engine {
/// Returns a async fuel checker for the current thread. Can return `Err` in case the platform unsupported or
/// the engine is not configured for slacked fuel metering.
pub fn current_thread_async_fuel_checker(&self) -> Result<AsyncFuelChecker> { ... }
}
We could reserve the SIGUSR1 and use pthread_sigqueue
with a magic cookie parameter to differentiate signals sent by wasmtime. If the signal handler receives SIGUSR1 without the cookie parameter, it will be passed to the previously installed, presumably by the user, signal handler.
One caveat here is that macOS does not really provide anything more than basic pthread_kill
. Obviously, one way to solve it is to get away without sending this magic cookie. However, there is another caveat with pthread_kill
. Apparently, pthread_kill
cannot be used for threads managed by GCD, which is not the end of the world but still annoying.
On Linux/FreeBSD checking and interrupting happens on the same thread in a signal. We don't have to do that on macOS, even with the feature posix-signals-on-macos
on[^1]. That is because macOS has thread_suspend
, thread_resume
, thread_get_state
and thread_set_state
. Those primitives are the same as Windows provides. Therefore we can do the same sequence as with Windows: suspend, check the fuel, if overflown, modify the PC in the thread's context, and resume. This approach does not depend on any global resources such as signals, so we are golden on macOS and Windows.
[^1]: although we still have to perform the fuel check in the trap to avoid the race condition I mentioned here (grep "There is a potential race between a trap and the detection of OOG. For example, ")
With that here is the second iteration.
The compilation part is almost the same so I quote it.
- Add a new configuration option
slacked_fuel_metering(bool)
. That option can only be enabled together withconsume_fuel
. It is also saved in tunables so that it's accessible in func_environ.- If
consume_fuel
is enabled, we also set cranelift flagenabled_pinned_reg
totrue
- In func_environ, we would stop relying on cranelift_frontend SSA def-use var for
self.fuel_var
. Instead,use_var
becomesget_pinned_reg(I64)
anddef_var
becomesset_pinned_reg
.- In func_environ, if
slacked_fuel_metering
enabled then we skipfuel_load_into_var
,fuel_save_from_var
andfuel_check
.- In compiler.rs, we change the trampoline generation for host-to-wasm and wasm-to-host1. In both cases, we surround the code that performs the call with code that flushes/reloads the fuel counter. This code is predicated on the
slacked_fuel_metering
2 option. The flush/reload code snippets are very similar to those found in func_environ3, with the addition that a fuel check is made before entering or after leaving the wasm code.- Note that the previous item promotes trampolines from just argument adapters to dealing with the fuel counter. We will have to use the trampolines in some places where they are not needed before. One of such places is calling the
start
function. It's easily solved, though: justlookup_trampoline
, and that's it. As it was mentioned, we will have to fix/remove thehost_trampoline
.
I say almost the same because pinned_reg, which is essential for this technique, works only on 64-bit platforms currently.
Obviously, we don't need to export wasmtime_runtime::is_wasm_pc
as in the first iteration, but we need it internally for the implementation. There is a caveat that I did not see before: in the presence of several engines, it won't be enough to only have the answer if the given PC refers to wasm code, but we also should know if that code was generated with fuel in pinned register.
Then I suppose Engine::current_thread_async_fuel_checker
would check if the slacked fuel metering is enabled and also if the current platform and arch are supported. Ultimately, it should be Linux, FreeBSD and macOS on x86_64 and aarch64. If the check does not pass, it returns an Err
.
The Linux/FreeBSD side of implementation:
AsyncFuelChecker::new
notes the current thread using pthread_self
. AsyncFuelChecker::check
bails if pthread_self
is equal to self.tid
. Otherwise calls pthread_sigqueue(self.tid, SIGUSR1, MAGIC)
. MAGIC
could be a random number or maybe an address of a private static field for sival_ptr
.platform_init
which indicates if slacked_fuel_metering
is enabled. If it is, then we register
trap_handler
for SIGUSR1
. Again, we do that only for Linux/FreeBSD. I am not sure about SA_NODEFER
though since we don't want reentrancy for SIGUSR1, but the same signal handler is used for other things.CallThreadState
we add another parameter which indicates if the slacked fuel metering enabled. It's plumbed through catch_traps
.jmp_buf_if_trap
(or maybe in it, but the important part that we remember that OOG has prevalence over a normal trap) we check if the slacked fuel metering is enabled. If so, then we check if the PC falls in any jit generated wasm code. If so, we extract r15/x21 and check if it is positive. If so, we note the unwind reason (initially, without a backtrace) and jump out via longjmp. If the signum is SIGUSR1 then we swallow the signal (similar to how the signal handler would return if the signal was handled by the user provided signal handler callback).The macOS side of implementation:
AsyncFuelChecker::new
notes the current thread using mach_thread_self
.AsyncFuelChecker::check
checks if mach_thread_self
equals to self.tid
and if so bails. Otherwise, calls thread_suspend
with the self.tid
. Then extract the current thread state through thread_get_state
. If the context's pc points into jitted wasm code we examine r15/x21. If it is positive we modify rip/pc to a shim by calling thread_set_state
not forgetting to align the stack if needed. The shim will save the unwind reason (currently capture no stacktrace) and longjmp with the buf extracted from tls. After calling thread_resume
the function returns. All errors are ignored, in particular the thread non-existance is expected. Other errors may be logged with debug
level.CallThreadState
we add another parameter which indicates if the slacked fuel metering enabled. It's plumbed through catch_traps
.posix-signals-on-macos
we modify the signal handler in the same way as we did for Linux (in fact, it's the same code, we just need to make sure it works in both cases). SIGUSR1 is not involved in this case.resume
to the out-of-gas shim.The Windows implementation is similar to macOS' but with the corresponding APIs and without fuffing with posix-signals-on-macos
.
Now, a couple of notes:
That all sounds great to me, thanks for writing this all up! Personally I don't have any other comments at this time, I'll poke around and see if others have thoughts too though.
Hey @alexcrichton, have you had a chance to source some thoughts on this one? What would be the next steps here? Would a prototype help at this point?
I asked around but no one posted here, so I think that's either ETOOBUSY or "seems fine". It seems reasonable to me to start with a prototype unless someone else feels differently.
Focus on this
2.
AsyncFuelChecker::check
bails ifpthread_self
is equal toself.tid
. Otherwise callspthread_sigqueue(self.tid, SIGUSR1, MAGIC)
I was assuming that, unlike raw TIDs/PIDs, pthread_t
is not vulnerable to the reuse problem. But turns out it is! pthread docs state that using a pthread_t
for a thread that ended its lifetime is not safe. You cannot use pthread_kill(tid, 0)
to check if the thread is alive. I was not able to find a way that checks this.
Everywhere I look said that one should either pthread_join
or have a bool variable that would indicate if the thread is running. In our case, AsyncFuelChecker
does not own the thread and cannot join it. The bool flag option seems to be better at the first glance, but looking closer it is also not ideal. To give an example why imagine we create a RAII guard that sets the flag to true before entering into wasm allowing AsyncFuelChecker
to send signals. However, there is a subtle side-effect: the thread may get killed without running the destructors. That must be safe since forgetting stuff must be safe, but it is not: the flag is still on but the pthread_t
is now invalid.
I went through a couple of possible alternative strategies:
pidfd
seems to be a solution. However, it is a bit too fresh. It requires 5.x. Debian Buster (LTS until 2024) is still on 4.19.ptrace
allows us to catch the process exit. Moreover, it has all the primitives (suspend, read/write regs, resume) for implementing macOS-on-ports and Windows-like checks. It seems, however, to be a bit heavy API which I don't completely understand. Not sure if there are no underwater rocks present.clone
with CLONE_CHILD_CLEARTID
.None of those seem ideal.
I am wondering if maybe I am missing something obvious? Maybe I am paying too much attention to this edge-case? Or maybe it is OK to require a new kernel and maybe gracefully degrade to YOLO kill
with the saved TID? Or if there are better suggestions how to solve this?
I don't think that pthread_kill
is safe in any Rust world so I think it's fine to assume that destructors run in the thread. That'd enable a TLS variable to be created where the destructor sets the flag and then the signal isn't sent if the flag is set.
wasmtime right now has the fuel mechanism. It allows precise control of how many instructions are executed before suspending execution, at least at a basic block granularity. The price is a rather significant performance hit.
I believe we can do better in case the user only wants to know if the execution exceeded some fuel limit. The cost to pay is that condition is detected with a delay, hence slacked. Therefore, this works similarly to the fuel metering mechanism only if there are no side effects when the execution reaches its deadline. For instance, all changes performed are rolled back if a DB transaction exceeds its fuel allowance.
The idea is to take the existing fuel mechanism, remove the compare & conditional jumps, and check the out-of-fuel condition asynchronously.
The tricky part is the asynchronous fuel variable checking and suspending the execution.
We could leverage HotSpot-style safepoints to implement this. Such a safepoint could be implemented just as access to a page. Suspending the execution could be triggered by
mprotect
-ing the page. Similarly to epoch interruption, we could place safepoints when entering a function and loop backedges. We would need to extract the fuel counter while handling the page fault signal. We assume that the codegen thoughtfully left us a mapping where the fuel counter is saved: either a register or an offset for a spilled variable. Having recovered the fuel counter, we'd check if it exceeded the limit and yield if so.I realized that this is not trivial to benchmark. I tried removing
fuel_check
, but that would lead to DCE of the now unused fuel value. I would appreciate suggestions on how to benchmark that properly.