WebAssembly / stack-switching

A repository for the stack switching proposal.
Other
147 stars 13 forks source link

Suspension across JavaScript frames #49

Closed conrad-watt closed 3 months ago

conrad-watt commented 7 months ago

I've been reading the new Bag of Stacks explainer. Is it intended that intermediate JavaScript frames will also be captured by a suspended coroutine? For example if I execute switch.call to suspend my current coroutine as $c0, and in my new coroutine pass this reference through a JS call back into Wasm (possibly multiple times), and then call switch on $c0, will the newly suspended coroutine $c1 capture these intermediate JS call frames, so that if I switch back to $c1 it's possible to return from my current execution back into one of the JS frames?

If it is intended that JS frames should be captured, has there been any discussion as to how the single-threaded case could be extended to concurrent work stealing? Supporting concurrent work stealing while preventing a JS frame from being inappropriately captured and transferred to another thread has been one of our main concerns when considering the intersection of our new threads proposal and stack switching.

fgmccabe commented 7 months ago

It is definitely NOT intended that stack switching in any guise will be permitted to capture JavaScript frames.

conrad-watt commented 7 months ago

In the example I sketched above, is it expected that the switch to $c0 call will trap because of the intervening JS frames being captured? If so, has there been discussion of how an implementation will check for the presence of these frames?

fgmccabe commented 7 months ago

V8 checks for this by using a counter on the boundary between wasm and JS.

fgmccabe commented 7 months ago

Note that the scenario above was possible in the current/old JSPI design: by passing Suspender objects to JS & back. There are also simpler so-called sandwich scenarios which are also prohibited (not involving sharing of stacks/suspenders).

thibaudmichaud commented 7 months ago

Small correction on the V8 implementation: we don't use a counter anymore. We check whether we are currently running on a secondary stack, which is the case if and only if suspending is currently allowed. E.g. in the "sandwich scenario", the call stack would look like JS -> promising export -> wasm -> JS -> wasm -> suspending import. We switch from the secondary stack to the central stack in the wasm->JS call, so the suspending import will trap.

conrad-watt commented 7 months ago

Thanks all for the answers! Apologies for my repeated questions, I'm trying to get up to speed on everything. What's the intuition for how the above wasm->JS checking discipline for JSPI generalises to Bag-of-Stacks? Is the general idea that promising export is roughly analogous to switch.call in that it implies switching to a secondary stack? Does the suspending import bring execution back to the primary stack?

Thinking about work stealing, even in the "well-behaved" JS -> wasm -> [switch.call] -> wasm -> [switch], it seems that the initial top-level JS frame would need to be kept alive (I think of this as being "captured") so long as the suspended coroutine produced by switch.call is kept alive --- later on someone could switch to this suspended coroutine and then return from the first Wasm frame to get back into JS (if I'm misunderstanding something about this scenario, please correct me).

I don't have a clear picture of how to deal with this scenario in order to enable concurrent work stealing where keeping alive the JS frame from another thread would be problematic. The best I can envisage is a "shared" version of switch.call which switches you to a new coroutine where shared suspension is allowed, at the cost of not capturing the prior coroutine if its context was non-shared. This seems to have ergonomic problems though since you permanently lose the ability to return to the JS top-level.

fgmccabe commented 7 months ago

The 'may not capture JS frames' is a hard requirement from TC39. Independent of the actual form of stack switching. do Currently, in V8, we do 'switch back' for all calls to imported JS functions, not just for suspending imports.

We have been thinking some about the top-level part of this, in particular in how to handle traps. One way would to replace:

JS -> wasm -> [switch.call] ->

with

JS->Promising->wasm->[switch.call]->...

This would also address the work stealing scenario (I believe).

tlively commented 7 months ago

@conrad-watt

The best I can envisage is a "shared" version of switch.call which switches you to a new coroutine where shared suspension is allowed, at the cost of not capturing the prior coroutine if its context was non-shared. This seems to have ergonomic problems though since you permanently lose the ability to return to the JS top-level.

One way to switch from a non-shared context to a new, shared-suspendable stack would be to create the new stack with a shared-non-suspendable function that takes a non-shared stack reference parameter. When switching to the new stack, that root function would do whatever it needs to with the non-shared stack reference (i.e. stick it in a global, a scheduler queue, etc.), then return_call into a shared-suspendable function. The return_call gets the non-suspendable frame off the stack, leaving only suspendable frames.

For switching from non-shared contexts to shared stacks that are not brand new, that trick doesn't quite work out of the box. The problem is that the shared stack would need to be in a suspendable context before executing a switch and would need to be in a shared-non-suspendable context afterwards so it can receive the non-shared stack reference. One solution would be a version of switch that simultaneously introduces a shared-barrier of some sort. Another solution would be a return_switch that behaves similarly to a return_call and can return-call into a shared-non-suspendable function that again can receive the non-shared stack reference as a parameter.

tlively commented 7 months ago

To avoid capturing JS frames, I think bag-o-stacks will require the counter implementation rather than the "switching allowed iff current stack is non-main" implementation. Assuming we keep the implementation strategy of implicitly switching to the main stack to execute JS imports, we will also need to implicitly switch back to the non-main active stack when those imports call Wasm exports, contrary to how I understand JSPI to work.

Here's what goes wrong when you don't implicitly switch back to the non-main active stack when calling exports:

JS
|
Wasm
|
. => . ;; explicit switch (A)
     |
     Wasm
     |
. <~ . ;; implicit switch to main stack to call JS import
|
JS (B)
|
Wasm
|
X ;; switch back to stack ref from (A) does not return into JS (B)

To fix that, we need to implicitly switch back to the active stack when calling a Wasm export. This makes the implicit switching to the main stack an unobservable implementation detail. But if you allow switching any time you are on a secondary stack, you'll still get problems:

JS
|
Wasm
|
. => . ;; explicit switch (A)
     |
     Wasm
     |
. <~ . ;; implicit switch to main stack to call JS import
|
JS (B)
|
. ~> . ;; implicit switch to active stack to call Wasm export
     |
     Wasm
     |
     X ;; switch back to stack ref from (A) does not return into JS (B)

To prevent both of these problems, we need to disallow switching on a non-main stack whenever it logically contains non-suspendable frames, whether or not the implementation actually executes those frames on the non-main stack. This requires either a stack walk (which is probably how we would spec it) or a counter of non-suspendable frames.

edit: the reason to spec this check as a stack walk (i.e. some "bubbling up" mechanism) is that the check would bubble up into the import itself, at which point it will simply continue bubbling up and do the right thing if the import is a Wasm function, or otherwise it will bubble up into the embedder. If the embedder is ok with its stack frames being suspended, it can continue bubbling the check up past the import. Otherwise the embedder can bubble up a trap.

conrad-watt commented 7 months ago

We have been thinking some about the top-level part of this, in particular in how to handle traps. One way would to replace:

JS -> wasm -> [switch.call] ->

with

JS->Promising->wasm->[switch.call]->...

This would also address the work stealing scenario (I believe).

Does this mean that switch would fail/trap unless the Wasm call stack was initially entered via a promise (to avoid capturing the top-level JS frame)? If so, does this mean that if any internal component of a Wasm module graph wants to use coroutines for its control flow, then synchronous call/return into any transitively-associated Wasm code would no longer be possible? Is there any alternative mechanism for guarding the top-level JS frame that would preserve the capability for synchronous entry into Wasm?

To prevent both of these problems, we need to disallow switching on a non-main stack whenever it logically contains non-suspendable frames, whether or not the implementation actually executes those frames on the non-main stack. This requires either a stack walk (which is probably how we would spec it) or a counter of non-suspendable frames.

Is there a reason that the counter approach was previously moved away from in V8? (ref @thibaudmichaud's comment).

EDIT for @tlively's other comment:

One way to switch from a non-shared context to a new, shared-suspendable stack would be to create the new stack with a shared-non-suspendable function that takes a non-shared stack reference parameter. When switching to the new stack, that root function would do whatever it needs to with the non-shared stack reference (i.e. stick it in a global, a scheduler queue, etc.), then return_call into a shared-suspendable function. The return_call gets the non-suspendable frame off the stack, leaving only suspendable frames.

This seems to imply the need for a check that there are no further parent non-suspendable frames right?

i.e. the following should be allowed non-suspendable top-level -> tail call to suspendable frame -> switch

but the following should be disallowed non-suspendable top-level -> non-suspendable frame -> tail call to suspendable frame -> switch

Also to check my understanding, in the context of going from top-level JS -> Wasm, would the initial step of "create the new stack with a shared-non-suspendable function" involve something like the promising strategy sketched above by @fgmccabe, or is the idea that the "tail-call at top level" maneuvre would be done in JS?

thibaudmichaud commented 7 months ago

Is there a reason that the counter approach was previously moved away from in V8?

This simplified the code and removed the small overhead of incrementing/decrementing the counter in the wasm->JS code. Nothing that would prevent us from moving back to it if this is required for core stack-switching, I think.

conrad-watt commented 7 months ago

One further thought to lob on top of the questions from my comment above. There's reason to believe that work stealing potentially needs more thought beyond the immediate JS->Promising->wasm->[switch.call]->... solution.

All (non-async) JS is assumed to be "non-suspendable", and all Wasm is assumed to be "suspendable", so JSPI modifications are only needed at the JS->Wasm boundary. As I mentioned above I have devex/encapsulation concerns with tying the internal Wasm use of coroutines to mandatory promises at the top-level JS boundary, but I can see on a technical level that it's workable as a conceptual extension of JSPI, because if the suspended coroutine eventually terminates, it will do so in the same thread that the promise is held in, causing it to unproblematically resolve.

In the shared case, things are more complicated. First, if a computation is suspended as a shared coroutine in one thread, and then the suspended computation is completed in another thread, should the JS promise at the boundary in the first thread resolve, and if so, how should the promise be kept alive if its only root comes from the resumed coroutine executing in another thread? Second, and more fundamentally, Wasm functions are not shared-suspendable by default - at the very least we will have the unshared->shared-suspendable (pure Wasm) boundary to worry about - not to mention shared-nonsuspendable->shared-suspendable in some scenarios. How is this boundary-cross supported in pure Wasm? Do we find some way to do it synchronously, or does it need to be purely asynchronous - e.g. a pure Wasm equivalent of the .then promise method that would be used to wire things up if this was a JS->Wasm-suspendable boundary? (edit) The former approach seems intrinsically difficult for the same reason that synchronous JS->Wasm-suspendable is difficult, the latter seems challenging compared to the JS case since pure Wasm doesn't have the JS event loop to piggy-back off of.

fgmccabe commented 7 months ago

@conrad-watt "All (non-async) JS is assumed to be "non-suspendable", ..." This is probably false. Because, as you identified before, allowing wasm to suspend can lead to JS being suspended.

conrad-watt commented 7 months ago

@conrad-watt "All (non-async) JS is assumed to be "non-suspendable", ..." This is probably false. Because, as you identified before, allowing wasm to suspend can lead to JS being suspended.

My understanding has been that all the possible mechanisms such as the discussed JS->Promising->wasm->[switch.call]->... boundary are intended to structurally prevent suspending Wasm from also causing JS to suspend. In particular there was an earlier comment

The 'may not capture JS frames' is a hard requirement from TC39.

which I interpreted as meaning that a suspended Wasm coroutine must not suspend any JS, so as to avoid being forced to capture a suspended JS frame.

Is there an idea of some mechanism for suspending Wasm that does cause JS to suspend, but doesn't cause the suspended JS frame to be captured? I don't have a clear image of how this would work - currently my impression is that (edit: non-trivial) suspension implies capturing.

fgmccabe commented 7 months ago

I misquoted, it was actually: "and all Wasm is assumed to be "suspendable", " that is false. As you noted, we will likely have to go through JSPI to access suspending in wasm.

conrad-watt commented 7 months ago

Sorry, I may have written that sentence in an unhelpful way. My point is that all Wasm functions conceptually have a static nonshared-suspendable property (with analogy to the shared-suspendable property we've been discussing in threads) - another way of saying this is that pure Wasm functions have no inherent restrictions on being suspended as non-shared coroutines - we're not introducing a new kind of "suspendable" function and preventing suspensions of existing functions. This means we don't need to worry about the boundary between such hypothetical "non-suspendable" and "suspendable" functions in pure Wasm. We only need to worry about the JS ("non-suspendable") to Wasm ("suspendable") boundary, and JSPI is a technically sufficient mechanism to protect this boundary (although I feel that its ergonomics are imperfect).

The issue when extending to shared is that existing pure Wasm functions are not inherently shared-suspendable. So now we do need to start worrying about the Wasm ("non-shared-suspendable") to Wasm ("shared-suspendable") boundary. JSPI works in the JS->Wasm non-shared case because the resolution of the JS promise can be put on the event loop after the completion of the underlying Wasm computation. As I sketch in the latter part of my comment here, the situation seems more challenging in the Wasm->Wasm shared case (edit: and to a lesser extent the JS->Wasm shared case).

fgmccabe commented 7 months ago

@conrad-watt Regarding "should the JS promise at the boundary in the first thread resolve, and if so, how should the promise be kept alive if its only root comes from the resumed coroutine executing in another thread? " I don't think this is a major issue. I believe it amounts to a write in one thread and a read in another (the original thread reads the Promise result from the writer).

fgmccabe commented 7 months ago

I am still trying to grok why the "non-shared-suspendable" -> "shared-suspendable" boundary is different to "non-shared"->"shared".

tlively commented 7 months ago

EDIT for @tlively's other comment:

One way to switch from a non-shared context to a new, shared-suspendable stack would be to create the new stack with a shared-non-suspendable function that takes a non-shared stack reference parameter. When switching to the new stack, that root function would do whatever it needs to with the non-shared stack reference (i.e. stick it in a global, a scheduler queue, etc.), then return_call into a shared-suspendable function. The return_call gets the non-suspendable frame off the stack, leaving only suspendable frames.

This seems to imply the need for a check that there are no further parent non-suspendable frames right?

i.e. the following should be allowed non-suspendable top-level -> tail call to suspendable frame -> switch

but the following should be disallowed non-suspendable top-level -> non-suspendable frame -> tail call to suspendable frame -> switch

Right. We never settled on precisely how shared-non-suspendable functions, shared-suspendable functions, and shared-barriers should interact, so I'm sure there are some designs where this solution wouldn't work. I was assuming some sort of counter to detect when there is a non-suspendable frame on the stack.

Also to check my understanding, in the context of going from top-level JS -> Wasm, would the initial step of "create the new stack with a shared-non-suspendable function" involve something like the promising strategy sketched above by @fgmccabe, or is the idea that the "tail-call at top level" maneuvre would be done in JS?

I don't believe pure-Wasm stack switching, whether multithreaded or otherwise, should require Promises on the JS->Wasm boundary, but I haven't talked to @fgmccabe about this in detail and I'm not sure we're on the same page. I was imagining that the creation of that new stack, the switching, and the tail call would all be pure Wasm. The main stack is never shared-suspendable because it always has a JS frame.

Since we're talking about promises, here's what I've been thinking for how JSPI could be respecified in terms of bag-o-stacks stack switching. JSPI-wrapped exports would implicitly allocate and switch to a fresh stack. The import wrapper, upon receiving a Promise, would switch back to the main stack, where the resumed export wrapper would chain the promise with a switch back to the non-main stack. This scheme also works for implementing JSPI in user space, which is nice.

conrad-watt commented 7 months ago

@tlively

Right. We never settled on precisely how shared-non-suspendable functions, shared-suspendable functions, and shared-barriers should interact...

The main stack is never shared-suspendable because it always has a JS frame...

I don't believe pure-Wasm stack switching, whether multithreaded or otherwise, should require Promises on the JS->Wasm boundary...

I believe the "delimited" stack switching solution (typed continuations) and the "undelimited" solution (bag-of-stacks) need very different mindsets. In a delimited setting, a synchronous call from JS->Wasm can be naturally interpreted as a handler, so a signal/stack walk propagating back up can be intercepted and turned into an error, preventing the JS frame from being captured.

In an undelimited setting where the whole stack is captured without a signal/stack walk, more care needs to be taken around synchronous boundary crosses. In particular it's not enough to declare that transitioning synchronously from JS->Wasm switches from a "non-suspendable" to a "suspendable stack". The issue comes when returning from the callee Wasm "stack". Consider, in the undelimited world, what happens when you suspend within the Wasm frame, and later resume and return from it: execution would naively jump back to the caller JS frame, which would imply a need to capture it during the initial suspension. In an undelimited setting all return operations in the coroutine must transfer control back to their original caller. This is in contrast to the delimited world where there is a "delimited" top-level frame in the continuation (identified by the stack walk) such that return from it will jump execution to the context in which the continuation was resumed.

The obvious (but brutal) solution for this concern in the undelimited setting is to structurally prevent the called Wasm from returning to the caller JS frame - mandatory JSPI accomplishes this because the JS->Wasm transition is no longer a synchronous call - now when the Wasm returns, instead of synchronously resuming execution in the caller JS frame, the returned result is fed to a promise resolution enqueued on the JS event loop.

The downside of this is that regular synchronous calls from JS -> Wasm are the "normal way" for the two to interact, so there's a tradeoff of user ergonomics (and potential misc overheads from promises/the JS event loop) in exchange for avoiding the stack walk cost associated with delimited continuations. Perhaps there's an alternative solution in the undelimited setting with different characteristics, but it would need to be evaluated carefully.

EDIT: tail calls could be a potential mechanism, but the tail call would need to be carried out at the JS->Wasm boundary, and there would need to be no further parent JS frames, which seems challenging.

Then, we get to the extended version of this boundary concern when going from nonshared->shared-suspendable in pure Wasm. We previously discussed a "delimited" solution with a special kind of handler which would catch attempts to propagate a "shared-suspend" signal through a nonshared frame. Again, we have to adopt a different mindset when dealing with undelimited continuations, where there is no signal to catch.

@fgmccabe

I am still trying to grok why the "non-shared-suspendable" -> "shared-suspendable" boundary is different to "non-shared"->"shared".

In a world without shared coroutines, direct calls from nonshared->shared are fine, because the shared part of the call stack is guaranteed to continue executing in the same thread as the nonshared part of the call stack. When the shared part returns, the nonshared part similarly continues executing in the same thread. That is, shared functions can be shared between threads, and executed in any thread, but once execution of the function starts the associated call stack is "pinned" to that thread until completion. In a world where the computation can be suspended as a shared coroutine, if the computation is resumed in a new thread, returning from shared to nonshared becomes more problematic, because without intervention the nonshared part would start executing in the new thread, violating our GC invariants. Again the obvious solution in the undelimited setting is to structurally prevent the shared part from directly returning to the nonshared caller, but in pure Wasm we don't have the JSPI mechanism so some more thought is needed.

WRT the resolving cross-thread promise...

I don't think this is a major issue. I believe it amounts to a write in one thread and a read in another (the original thread reads the Promise result from the writer).

The main issue is the GC requirement to keep the yet-unresolved promise alive cross-thread. This isn't impossible, but would commit us to a particular implementation strategy that V8 is experimenting with, but isn't universally supported (for @tlively, it's ephemerons again). I agree this isn't as fundamental as the pure Wasm boundary concern.

tlively commented 7 months ago

Sorry, my previous explanation wasn't correct or clear enough. Here's an attempt at a well-organized explanation of how bag-o-stacks could interact with JS without any mandatory JSPI.

The property we need to maintain is that JS cannot observe stack switching. In other words, JS frames should follow a stack discipline, i.e. they should be returned from in the opposite order they are entered, just as if there were only a single global stack (per thread). This property is violated iff it is possible to "skip" a JS frame when returning. As long as we maintain this property, it doesn't matter what stacks JS frames are logically or actually executed on.

The key insight is that the main stack and non-main stacks are fundamentally different and therefore can have different rules about when switching is allowed. The main stack always has JS frames above the latest Wasm frames. Non-main stacks always start with Wasm frames, so they may or may not have JS frames above the latest Wasm frames.

If we can maintain the invariant that JS frames follow a stack discipline when non-main stacks are active, then it will always be safe to switch off of the main stack, despite the fact that it might have arbitrarily interleaved JS and Wasm frames. Stack discipline is necessarily maintained when the main stack is active because it is a single stack, so if stack discipline is also followed whenever non-main stacks are active, then stack discipline will always be followed.

To maintain stack discipline whenever non-main stacks are active, we can disallow switching off of a non-main stack whenever it contains a JS frame. This is a reasonable restriction for non-main stacks because, unlike the main stack, it is actually possible for them to not contain JS frames. Stack discipline is maintained because the system is restricted to use a single stack as long as there are JS frames on the non-main active stack.

To summarize, the rules are:

These rules also prevent traps or exceptions from making stack switching observable to JS.

This presentation was clearly specific to the JS embedding, but it extends to other embeddings as well. We could conservatively enforce these rules for all embeddings, preventing any embedder from observing stack switching. Alternatively, we could allow embedders to opt-in to be suspendable when they call back into Wasm on non-main stacks.

Extending these rules to handle shared stack references is straightforward. It is never valid to switch off of any stack, including the main stack, if it contains non-shared-suspendable frames and the switch would produce a shared stack reference.

conrad-watt commented 6 months ago

The property we need to maintain is that JS cannot observe stack switching. In other words, JS frames should follow a stack discipline, i.e. they should be returned from in the opposite order they are entered, just as if there were only a single global stack (per thread).

Can you explain in more detail why this is sufficient? In particular @fgmccabe mentioned another property, "may not capture JS frames". The key point I was trying to get across in my previous comment was that even if you consider Wasm to be running in a "non-main stack", if it's possible for that Wasm code to return to a conceptually separate parent JS main stack frame, then if that Wasm is suspended, the parent JS stack frame must be captured too unless special measures are taken to change the behaviour of that return (e.g. inserting a delimiter, or some other structural approach like JSPI). Otherwise, after resuming the Wasm code it will be impossible to correctly execute the return instruction.

When I try to synthesise my understanding with your comment above, I wonder if we have different views of what it means to capture a JS frame. Do you consider JS frames on the main stack to never be "captured" by definition, even if they are only present in order to support the future execution of a suspended coroutine, and will only be removed from the stack if the coroutine is resumed? I could potentially agree with this perspective if we were only talking about the initial top-level JS stack frames, but you also seem to be suggesting in a previous comment that calling JS exports from Wasm would cause further JS frames to be added to the main stack. This seems to me to create a much more clearcut example of "capturing" if such JS frames can themselves synchronously call suspendable Wasm, even if that Wasm is moved to a "non-main" stack.

tlively commented 6 months ago

The property we need to maintain is that JS cannot observe stack switching. In other words, JS frames should follow a stack discipline, i.e. they should be returned from in the opposite order they are entered, just as if there were only a single global stack (per thread).

Can you explain in more detail why this is sufficient? In particular @fgmccabe mentioned another property, "may not capture JS frames".

The real requirement from TC39 is that they don't want to worry about how to update the JS semantics or implementations to account for any new kind of non-local control flow. This is achieved if JS can never observe any control flow that was not previously possible, and that in turn is achieved if we require that JS frames follow a stack discipline, because for every program execution that uses stack switching, we could construct a separate program execution that does not use stack switching that calls and returns into JS in the exact same way.

With typed continuations, "may not capture JS frames" implies "JS frames follow a stack discipline," which we know satisfies "may not expose new non-local control," so we've been using the former as a shorthand for the latter. With bag-o-stacks, however, the intuitive meaning of "capturing" a frame is completely different, so we have to be more precise.

conrad-watt commented 6 months ago

I think I've just reached a better understanding of your proposed solution, so I'll try to explain my perspective and you can tell me if I've got something wrong.

With bag-o-stacks, however, the intuitive meaning of "capturing" a frame is completely different, so we have to be more precise.

It seems to me that the definition of capturing ends up basically the same in both the typed-continuation and bag-o-stacks case - I'm interpreting the second diagram in your comment here + your explanation above as meaning that it's considered ok (as an exception to "never capture JS frames") for at most one coroutine (counted globally) to capture JS frames, because in this case there are still no JS frames in other coroutines that would make switches observable to JS. Since all Web execution starts with a JS frame, this allowance is essentially always "used up" by the initial suspended coroutine created by the very first switch.call. I'm a little suspicious as to whether capturing JS frames in this way plays well with engines, but even if it turns out that this special case isn't ok and no coroutines are allowed to capture JS frames, the language features we're discussing don't fundamentally change - it would just change where your proposed delimiters/handlers are inserted (see below).

In the same comment, I interpret your proposal as essentially adding a delimiter/handler at the JS->Wasm boundary if the JS was itself entered from a non-initial Wasm coroutine (i.e. it is non-top-level JS). This prevents subsequent suspensions from capturing any JS frame. To avoid the cost of a stack walk (which is normally required to find handlers) the presence of the handler is encoded in a counter which is eagerly propagated through the call stack. So at an inner call where switch/switch.call is executed, the counter can be checked to determine if a stack walk would find a handler at the JS->Wasm boundary.

I agree the counter strategy is technically feasible and allows bag-o-stacks to continue avoiding a stack walk, although it's worth observing that this strategy relies on there only being one kind of handler (or more generally a very small number of special cases) in scope at once. Otherwise the information that needs to be eagerly propagated can no longer be encoded in a single counter (in the worst case, you might need to propagate more general references to the current in-scope handlers).

In particular trying to extend this strategy to cover the unshared->shared-suspendable case requires a second kind of handler that only traps on suspend-shared signals. So to continue avoiding a stack walk you'd need to at the bare minimum eagerly propagate two counters through the Wasm call stack. It might be possible to bitmask the two counters into a single scalar value at least, but the state would have to be managed at every relevant boundary.

It's also worth explicitly calling out that this strategy isn't a perfect story for expressivity/modularity, because there might be times you legitimately want to suspend your Wasm computation, but are prevented from doing so because you happened to enter it transitively through a non-main stack. In particular I may have a stand-alone module that happens to internally use coroutines for control flow, which I enter and exit as a black box. Whether or not this module can execute now depends on whether any of my transitive parent stack frames happen to contain a switch to a non-main stack.

EDIT: actually, the compositionality of this solution becomes particularly bad when generalised - consider the unshared->shared-suspendable generalisation - in pure Wasm this would mean that no shared-suspend could succeed if there are transitively any synchronous unshared->shared-suspendable calls anywhere in the call stack. At least in the immediate solution, there is the safety valve of JSPI to "reset" the counter.


Going in the other direction, I wonder what it would look like to restrict typed continuations to only having a single kind of handler (or very simplified selection of handlers) in scope at once, so that it could also take advantage of the "eager propagation" strategy. Are there any existing discussions on this?

In fact, if bag-o-stacks would commit to propagating a counter through all Wasm calls, could typed-continuations, even in its current form, avoid a stack walk by analogously eagerly propagating a pointer to the closest in-scope handler? [edit: this gets complicated because of daisy-chaining of subsequent handlers]

tlively commented 6 months ago

It seems to me that the definition of capturing ends up basically the same in both the typed-continuation and bag-o-stacks case... Since all Web execution starts with a JS frame, this allowance is essentially always "used up" by the initial suspended coroutine created by the very first switch.call.

Yes, exactly. That's another perfectly good way to look at it, and is closer to how we would probably spec this. That one globally counted initial stack would be the "main stack" by definition.

... I interpret your proposal as essentially adding a delimiter/handler at the JS->Wasm boundary if the JS was itself entered from a non-initial Wasm coroutine (i.e. it is non-top-level JS).

Switching off the main stack when there is a Wasm->JS->Wasm sandwich is still ok, so I think everything sounds right here except possibly for the "i.e. it is non-top-level JS", which I interpret to mean any JS that has Wasm somewhere above it on the call stack.

... To avoid the cost of a stack walk (which is normally required to find handlers) the presence of the handler is encoded in a counter which is eagerly propagated through the call stack... In particular trying to extend this strategy to cover the unshared->shared-suspendable case requires a second kind of handler that only traps on suspend-shared signals. So to continue avoiding a stack walk you'd need to at the bare minimum eagerly propagate two counters through the Wasm call stack...

Yes, exactly. I don't anticipate a need for further counters, but I'd be interested in hearing if anyone else has ideas for other extensions that would need more.

It's also worth explicitly calling out that this strategy isn't a perfect story for expressivity/modularity, because there might be times you legitimately want to suspend your Wasm computation, but are prevented from doing so because you happened to enter it transitively through a non-main stack...

In general yes, but only if a JS stack frame gets mixed up in things as well. This problem is also avoidable by doing exactly what the library would be doing with typed continuations anyway: allocating and switching to its own stacks internally so it knows there will be no interfering JS frames.

... At least in the immediate solution, there is the safety valve of JSPI to "reset" the counter.

Yes, or just switching to a new stack under local control. I guess we still need to figure out whether/how JSPI is going to interact with shared stacks at all.

Going in the other direction, I wonder what it would look like to restrict typed continuations to only having a single kind of handler (or very simplified selection of handlers) in scope at once, so that it could also take advantage of the "eager propagation" strategy. Are there any existing discussions on this?

Not that I know of.

In fact, if bag-o-stacks would commit to propagating a counter through all Wasm calls, could typed-continuations, even in its current form, avoid a stack walk by analogously eagerly propagating a pointer to the closest in-scope handler? [edit: this gets complicated because of daisy-chaining of subsequent handlers]

Yeah, we've thought of strategies where we propagate a vector of handler pointers indexed by event tag (although when @fgmccabe and I were thinking most about this it was in the context of dynamic scoping). I'm not convinced we came up with anything that correctly handled all the daisy-chaining, though.

conrad-watt commented 6 months ago

Thanks, I really appreciate your time, and I think we're now mostly on the same page about the state of the world.

This problem is also avoidable by doing exactly what the library would be doing with typed continuations anyway: allocating and switching to its own stacks internally so it knows there will be no interfering JS frames.

Yes, or just switching to a new stack under local control.

I've not yet seen a part of the bag-o-stacks proposal that supports allocating a new stack internally without either capturing the current stack (and thus hitting the trap that we're trying to avoid by creating new "internal" stacks), or piggy-backing off the JS event loop by completing asynchonously. To be unambiguous, the most challenging case is a Wasm library/module that wants to present a synchronous interface while doing internal control flow using coroutines/continuations - so having an asynchronous JS wrapper that sticks things behind JSPI wouldn't be sufficient. If bag-o-stacks intends to support this, would you be able to sketch the setup?

tlively commented 6 months ago

Ah yes, you’re absolutely right. Just switching to a new stack doesn’t work if the caller has its own interleaving JS frame and is on a non-main stack.

If you consider typed continuations to be a “stack of stack segments” design, it can maintain the JS stack discipline through multiple chained stack segments. Since bag-o-stacks doesn’t have that overarching stack organization, it has to be more conservative about how it interacts with JS to ensure the stack discipline.

fgmccabe commented 6 months ago

In fact, even with typed continuations (wasmfx), you will still need to prohibit interleaving JS frames with wasm frames in an active computation. In particular, when you suspend to a handler, you must verify that there are no JS frames en route to the handler. That implies a frame-by-frame inspection; you might be able to accelerate that with a stack counter: you are then aggregrating counters across multiple stacks/continuations.

tlively commented 6 months ago

Right, there can't be JS frames in the suffix of frames that get suspended, but importantly there can be arbitrarily many JS frames on any of the stack segments above the target handler. If you were to implement wasmfx in terms of bag-o-stacks, you would be limited to having JS frames only on the root stack segment, so wasmfx is strictly more expressive on this front.

fgmccabe commented 6 months ago

I don't think that wasmfx and bos are any different here.

fgmccabe commented 6 months ago

You may not suspend any JS frames. So, if you have them in your ancestry, a suspend must trap.

fgmccabe commented 6 months ago

Strictly, if there are any JS frames between the suspending frame and the handler frame; then you must trap. This does allow JS frames to be temporarily on the stack -- so long as you are not suspending them.

tlively commented 6 months ago

Here's a diagram showing the difference. The setup here is that we are either using wasmfx directly or using bag-o-stacks to emulate wasmfx, so we have a stack of stack segments. We are not suspending past the middle stack segment, so it is fine for it to contain a JS frame in wasmfx. In bag-o-stacks, however we can't even have reached this point because switching to the third stack segments with a JS frame in the second stack segment wouldn't have been allowed.

Stack switching and JS

lukewagner commented 6 months ago

I wonder if there's an alternative solution to this root problem that avoids the complex dynamic error cases:

First, generalizing from JS to arbitrary wasm hosts, I expect most wasm hosts will want their host imports to be called on the same stack (and thread) as the host used to call into the wasm runtime; anything else potentially creates significant challenges for the rest of the host code given the widespread ambient assumptions about native stacks (just like in the browser). So my impression is that our solution shouldn't require JS to be executed on wasm stacks -- it might be, but I think we should consider that an optional VM optimization and design under the assumption that JS only runs on the main stack.

Based on this, I was noticing some symmetry between stack-switching and multi-threading: in both cases we have a notion of a "main" thread/stack that we enter on from JS and that JS wants to be called back on. Thus, I'm wondering if we could add a function type annotation symmetric to shared which means (in its absence) "I'm on the main stack" (as opposed to "I'm on the main thread").

Just to sketch something that's probably not quite right but paints a picture, let's say we call our attribute switched:

Based on these rules, once you're in a switched function, the only way to "leave" (without trapping, which I'll get back to) is to switch to a stack that was produced by a switch.call from inside a nonswitched function (which was necessarily called from JS on the main stack). Thus, it's the guest wasm's responsibility to hold onto this "entry" stack so that when switched code wants to return to the JS export caller or call a JS import, it can switch to the "entry" stack (using parameters to convey whether to make an import call or return). Importantly, because this switch back to the main stack consumes the stack, there will only ever be at most one stack pointing to the main stack in the thread and thus the above dynamic error cases are avoided. This also means that if we trap, there is an unambiguous JS caller to unwind to. (I expect this also means we could consider this whole approach to desugar to a restricted use of typed continuations where, by construction, there is a max-segment-depth of 2.)

Lastly, I think switched would compose with shared in a mostly-obvious way; the only subtly I see is that a switch.call from inside a shared nonswitched function needs to pass a nonshared stack to the switch-target (since "I'm on the main stack" is a thread-local fact) whereas a switch.call from a shared switched function would pass a shared stack to the switch-target. Given that, for a language like Go that wants both threads and stacks, all but the entry/exit trampoline functions would be shared switched (so they can be work-stolen on any thread), but if you want to call from Go code into Worker- or Window-local JS, you have to switch back to the nonshared stack of the nonswitched function on the Worker's or Window's main stack, which you'd get to from a shared switched function via TLS... which seems about like what you'd expect.

(Sorry for the long comment, happy to take it to a different issue if we don't want to consider it inline here.)

slindley commented 6 months ago

Going in the other direction, I wonder what it would look like to restrict typed continuations to only having a single kind of handler (or very simplified selection of handlers) in scope at once, so that it could also take advantage of the "eager propagation" strategy. Are there any existing discussions on this?

Applying the restriction to a single handler being in scope at once gives rise to Lua-style asymmetric coroutines, or closer to home the Wasmtime fiber library on which the current Wasmtime implementation of WasmFX is built.

conrad-watt commented 6 months ago

@tlively

Here's a diagram showing the difference...

This is a great diagram! An additional concrete way I would characterise this, with my example of an encapsulated Wasm module with a synchronous interface but internal control flow based in coroutines/continuations - typed continuations allows "capturing" delimiters/handlers to be inserted within the module boundary to support this use-case, so that only the "prefix" of the stack (containing no JS frames) is captured whatever the state of the "suffix", but as you noted bag-of-stacks can't replicate this encapsulation, since its only delimiters/handlers cause traps.

@lukewagner

So my impression is that our solution shouldn't require JS to be executed on wasm stacks -- it might be, but I think we should consider that an optional VM optimization and design under the assumption that JS only runs on the main stack.

I think everything stays the same (including the distinguishing example between typed continuations and bag-o-stacks) if you think of the JS frames in @tlively's diagram as all running on the main stack but with trampolines to/from the separate Wasm stacks at each point that a non-top-level JS frame would be entered or "returned to/from". So I think we're already in the world you want us to be in. This is a knee-jerk reaction though, so please correct me if I've misunderstood something.

conrad-watt commented 6 months ago

@tlively apologies for this curt additional comment, but I'm on the move right now. There are some clarifications that should be made relating to the diagram that I didn't notice in my first reading. The place where typed-continuations and bag-of-stacks differ is that in the second (and third!) stack, typed-continuations allows handlers to be inserted so that the Wasm-only suffix of the stack can be suspended - this is where the "encapsulated module" would live in my example.

If we interpret the arrow between the second and third stacks as a synchronous call (allowing a return, although perhaps executing on a separate stack), both typed-continuations and bag-of-stacks (can) allow this. Similarly, if we interpret this arrow as suspending the second (whole) stack and switching to the third, both proposals disallow this (because the second stack's JS frame is captured). The gap in expressivity comes from being able to subdivide stacks through non-trapping delimiters, so that just the "well-behaved" part of the stack can be suspended.

tlively commented 6 months ago

The intention of the diagram is that the wasmfx handlers correspond 1:1 with the boundaries between the stack segments, under the assumption that the general implementation strategy will be to create a new engine-internal stack every time a wasmfx handler is encountered. So yes, the stacks could be subdivided further by adding more handlers, but that would just result in having more rectangles in the diagram and would not show any additional difference between bag-o-stacks and wasmfx.

rossberg commented 6 months ago

One well-known issue with undelimited continuations is that ultimately some delimiter is always need, at least at the boundary of the runtime system. The fact that JS and Wasm can interleave just makes this gap wider: you need a delimiter at every interleaving, but some associated handlers behave differently than others.

The upside is that we should be able to explain/encode the JS/Wasm interop behaviour needed for bag-of-stacks in terms of the primitives of the continuations proposal: essentially, bos installs an implicit handler at the initial JS→Wasm call (which does the switching and captures exceptions), and an implicit barrier at every inner JS→Wasm call (which is a form of reject-all handler that does not require its own stack — the implementation we have considered for supporting barriers is the exact same per-stack counter mentioned earlier in this thread).

In comparison, the continuations proposal will uniformly install an implicit barrier at every JS→Wasm call and nothing else.

As an aside, a direct consequence of this observation is that the continuations proposal is more expressive, because there is no such local encoding in the other direction — you’ll need a complex and potentially expensive whole-program transformation to express delimited continuations in terms of bos.

PS: I'm about to vanish from the radar for a week, in case I don't respond to follow-ups. :)

slindley commented 6 months ago

I find the picture by @tlively quite illuminating.

Another perspective that I think can be helpful to understanding here is that the restriction imposed by JS can be thought of in terms of well-bracketting of function calls. In order for JS->Wasm calls to appear to JS as if they are sequential they must be well-bracketted. So for instance let f and h be JS->Wasm calls and g be a Wasm->JS call such that f calls g and g calls h, then it is important that h returns before f does.

lukewagner commented 6 months ago

@conrad-watt Cool; I was worried when I started hearing about JS frames being taken off the main stack that it was an essential part of the design. That being said, that's not an essential motivation for the idea I presented in the rest of the comment. The essential idea is to use a function type annotation in a manner symmetric to shared to avoid what otherwise seems to be some fairly hairy dynamic failure modes.

conrad-watt commented 6 months ago

@lukewagner the biggest friction I see with the switched idea is that IIUC you have to switch back to a nonswitched stack every time you want to call JS. Since the switched attribute will transitively pollute a lot of the function types of the module (like shared), this means that there's a wide-reaching cost imposed on JS calls across the entire program even if only a small part of the execution actually suspends. I think this overhead ends up greater than the analogous overhead in the shared case (in the latter case direct calls either turn into ref.call or have a "current thread" check added at the Wasm->JS boundary). If one wanted to get back to allowing JS calls inside switched functions, I think one would need to reintroduce a delimiter at the JS->Wasm boundary, and then I think we're mostly back to the design we've been focussing on in this discussion.

lukewagner commented 6 months ago

@conrad-watt My underlying assumption is that in general (other than in some carefully-audited cases where the transitive call path has been audited to be "safe" for execution directly on these special wasm stacks) host functions called by wasm are going to need to switch away from any wasm-created stacks to the host's main stack and thus putting it directly in core wasm would be beneficial in making the cost model more explicit. I even thought that that was going to be the expected implementation for JS (and is currently the case in JSPI implementations), but maybe I'm wrong here?

conrad-watt commented 6 months ago

That's an interesting point. I realise I'm a little fuzzy on what some of the costs involved here are. My best understanding - with the current proposals, host functions called by Wasm don't need to switch to a new stack (although it's a permitted implementation if desired), so long as an appropriate handler or stack switch is installed on the JS->Wasm side to prevent any stack with a JS frame from being suspended.

With this view, the switched annotation would remove the need for the handler - in the bag-o-stacks proposal this is the overhead associated with the "counter". However it introduces a new overhead on the Wasm->JS side (the new need for an explicit switch).

If I've got the above wrong, someone please correct me!

lukewagner commented 6 months ago

I'd also be interested to hear any updated thinking on this from the JS engines, but my (possibly-stale) understanding was:

Thus, even though one can imagine the JS JIT being well-behaved and not making assumptions broken by being off the main stack, I had thought that the general plan of record was for generic Wasm-->JS calls to conservatively transition back to the main thread. But if that's no longer the thinking, then I agree that forcing the switched-->nonswitched transition may add unnecessary overhead (essentially because it's no longer reflecting the underlying cost model).

fgmccabe commented 6 months ago

@lukewagner V8 switches to the central stack for all calls to JS (and host code) primarily because of those subtle platform-dependent issues. Almost always this is not related to main computation flows. A good example is profiling: most profilers assume that there is only one stack and break in the face of multiple stacks. So we switch to the central stack so that we don't break those profilers (even though Chrome does not actually use a third party profiler - it might). Another example is GC: V8 recently switched to conservative stack scanning and the scanner as a whole breaks if there are multiple stacks.

We hope that, in some not too distant future, we can fix these issues and allow JS and host calls to run on secondary stacks; but now is not the time to fix that.

fgmccabe commented 6 months ago

@conrad-watt The reason we switch to the central stack has nothing to do with avoiding suspending JS/host calls. See above comment.

conrad-watt commented 6 months ago

@conrad-watt The reason we switch to the central stack has nothing to do with avoiding suspending JS/host calls. See above comment.

@fgmccabe I think your explanation actually matches what I was speculating above - that ideally Wasm->JS/host calls won't involve an implicit stack switch (and thus @lukewagner's suggestion to mandate an explicit switch on such transitions would potentially bake a performance penalty into the language). I did note that some mechanism (handler or stack switch) on the JS->Wasm side still seems necessary, such as the "counter" we discussed earlier in the issue.

tlively commented 3 months ago

The original question has been answered and the productive follow-on discussion has been resolved, so I'm going to close this, but feel free to reopen (or open a new issue) for further discussion.