shared-suspendable and shared-fixed as separate function types

During our discussion on https://github.com/WebAssembly/shared-everything-threads/issues/42, we discussed that a "safety valve" decision for JS function access, if we can't reach consensus on (strong/weak) thread-local functions, would be to (re)introduce a version of shared function that cannot have its execution suspended as part of a (hypothetical) shared continuation.

Currently our design doesn't permit nonshared parameters to shared functions in order to be forward-compatible with shared continuations, which would allow such a nonshared object to be smuggled into another thread by suspending execution and resuming in another thread. By forbidding such suspensions, we could allow nonshared parameters, and thus pass in a JS context (e.g. a struct containing unshared references to JS functions) as a regular parameter that would be threaded through execution (or a context local as sketched here), giving a mechanism to call JS functions from shared Wasm functions.

This issue is to discuss the design implications of this approach. A few initial points:

As noted in my previous comment, the question of how the stack pointer is handled can be answered separately, as it is a scalar parameter. We could choose not to have thread-local globals (threading the stack pointer as a parameter), but still choose to support (strong/weak) thread-local functions.
All calls to JS through the "context parameter" mechanism have to be indirect calls. Without a way to statically import JS functions as shared, unconditional inlining (the way V8 prefers) is likely infeasible.
- Strong/weak thread-local functions just about make unconditional inlining feasible, so long as every instance-crossing is accompanied by a bitmap "is-on-fast-path" check. The feasibility of this bitmap approach is discussed/criticised here.
- Given the above, maybe the only way to get unconditional inlining is to genuinely provide new JS builtins that are importable as shared, in the manner of string builtins.
To wear my heart on my sleeve, I'd prefer just the shared-suspendable semantics with some mechanism for thread-local functions (weak if necessary) over the below. That being said, I don't think the below is awful, and it seems less controversial from a GC engineering perspective.

Note that, if we still believe that shared continuations will eventually exist, the below approach doesn't permanently solve our current problems, but instead pushes them into the future. Any compilation scheme wanting to use shared continuations will need to mark most functions as shared-suspendable, and so for such a scheme we'd still need to solve the same problem of JS access that we have in the current design (e.g. by introducing thread-local functions).

Design Sketch

Terminology

For the purposes of this discussion, I'm going to refer to functions as being either nonshared, shared-suspendable, or shared-fixed (a different name for shared-nonsuspendable, which is a mouthful). The distinction between these functions would be enforced by a static annotation on the function type.

nonshared functions are what we have today. Remember that in general, shared things can't capture nonshared things.

shared-suspendable functions are the "fully shared" functions we've been discussing before this point:

No references, access, or calls to any module state with a nonshared type (globals/tables/functions)
No nonshared parameters/locals
No instructions that generate new nonshared references in the body (e.g. struct.new)
These functions can be used to create shared continuations shared-fixed functions are somewhat more relaxed:
Still no references, access, or calls to any module state with a nonshared type (globals/tables/functions)
nonshared parameters/locals are allowed
Instructions that generate new nonshared references in the body (e.g. struct.new) are allowed
These functions cannot be part of a shared continuation, but can be part of a nonshared continuation
Intuitively, the call stack of a shared-suspendable function could be captured as part of a shared continuation, and resumed in another thread. This means it's not safe for the frame to ever capture a nonshared object, even transiently. In contrast, shared-fixed function calls are guaranteed to stay in the same thread for the entire duration of their execution. Therefore it's safe to pass in nonshared objects as parameters, and materialise them during the function call's execution. Prior to the standardisation of shared continuations, only shared-fixed functions would be definable.

Restrictions on calling

(EDIT: see this comment for an alternative approach with different restrictions)

shared-suspendable functions can always call shared-fixed functions with no restrictions. The extent to which shared-fixed functions can call shared-suspendable functions depends on some design decisions of stack switching. By default, all forms of shared-fixed->shared-suspendable call would be disallowed by validation (implying annotations/type tracking of [non]suspendable on relevant call instructions).

If the stack switching proposal includes a lexical barrier instruction (e.g. see here), it seems feasible to also include a concept of a "shared-only" barrier which traps upon an attempt to capture a shared continuation, but not a non-shared one. All forms of call to shared-suspendable functions could be allowed inside the body of this barrier. On reflection, I don't think that a shared-fixed call should implicitly introduce such a barrier, since this would mean that every shared-fixed call would have the implicit overhead of "Check if I'm in a continuation and if so, set the barrier bit". Instead I think the shared-only barrier should always be explicit (and would switch validation from shared-fixed mode to shared-suspendable mode within its body). @rossberg please correct me if I'm wrong about the above.

A shared-fixed function could also call a shared-suspendable function by wrapping the latter as a shared continuation and using a hypothetical resume_barrier instruction (as sketched here https://github.com/WebAssembly/stack-switching/issues/44#issuecomment-1909545807). If a "shared-only" barrier proves infeasible, this would be the only way to make a shared-fixed->shared-suspendable call.

Note that in either case, it's still ok for a shared-fixed function to hold a reference to a shared-suspendable function; only calling is complicated. This means that we don't need to distinguish between different kinds of shared for tables and globals - the [non]suspendable distinction is only needed for callable things.

The big design question here is whether the difference between shared-suspendable and shared-fixed is reflected only in the validation contexts for the functions or also in the function types.

If the two are not differentiated at the type level, then there would be no way to statically disallow indirect calls from shared-fixed to shared-suspendable functions. That would be fine as long as shared-fixed functions had the dynamic semantics of implicitly setting a shared-barrier for the duration of their execution. We would be using a runtime check instead of the type system to ensure that shared-fixed function frames are never captured in shared continuations.

Alternatively, if we do differentiate at the type level, we can statically disallow shared-fixed functions from calling shared-suspendable functions outside an explicit shared-barrier. This would be kind of annoying because the different kinds of shared functions would not be interchangeable, but at least a shared-suspendable function could be adapted to be a shared-fixed function by a wrapper that explicitly set up a shared-barrier and called the underlying function.

Personally, I think the former option, where we do not distinguish at the type level, is more attractive. Having to have wrapper functions and explicit shared-barriers to achieve the same runtime semantics and function interoperability is a bunch of complexity and code size for no benefit AFAICT.

If the two are not differentiated at the type level, then there would be no way to statically disallow indirect calls from shared-fixed to shared-suspendable functions. That would be fine as long as shared-fixed functions had the dynamic semantics of implicitly setting a shared-barrier for the duration of their execution. We would be using a runtime check instead of the type system to ensure that shared-fixed function frames are never captured in shared continuations.

For this reason, my intuition is that we will need differentiation at the type level. Based on @rossberg's sketch here I believe that such a barrier would carry an eager runtime cost (at least setting a bit in the stack). This would mean that even code just compiling to shared-fixed would be paying a price "just in case" some shared-suspendable code gets into an indirect call. An even worse scenario would be if the barrier isn't initially implemented when shared-fixed is introduced, so that the later introduction of shared-suspendable at the language level would require existing shared-fixed code to regress in performance.

Having to have wrapper functions and explicit shared-barriers to achieve the same runtime semantics and function interoperability is a bunch of complexity and code size for no benefit AFAICT.

I think the benefit would be that code just using shared-fixed wouldn't need to incur a runtime overhead to defend against the possibility of a shared-suspendable indirect call. If one wants to go from shared-fixed->shared-suspendable, one needs to explicitly add the barrier/handler instructions that capture the overhead of setting the necessary bits in the stack/setting up the suspend handler.

Actually, now I'm wondering if the same argument about overheads applies to nonshared->shared-suspendable calls. This may mean that a type-level distinction between shared-fixed and shared-suspendable is actually valuable even if we do find a way to have thread-local functions, so as to allow code just doing nonshared->shared-fixed calls to avoid unnecessary overheads.

EDIT: IMO the variant I discuss in the OP with a resume_shared-barrier instruction would be a cleaner solution than a block-level shared-barrier which would require different validation rules in its body.

Actually, I realise there's an alternative design that may make more sense. Instead of allowing only shared-suspendable->shared-fixed calls, allow only shared-fixed->shared-suspendable calls. Now when a shared-suspend happens in a shared-suspendable function, it can just search for the first handler. If the handler is for a shared continuation, it's known that all the frames in between are shared-suspendable (because shared-resume can only happen on a shared-suspendable function). If the handler is for an nonshared continuation, trap. Calling from shared-suspendable to shared-fixed is allowed only through a nonshared-resume handler (which would cause shared-suspend in subsequent shared-suspendable frames to trap).

I think this might fit the existing model of stack switching better, where functions that may suspend can still be called even without a handler, but attempting to actually suspend just traps. It would also allow nonshared->shared-suspendable calls just fine. In my OP design, a shared-suspendable function can only be entered from other Wasm if at least one handler is created, which doesn't seem consistent with the unshared case.

because shared-resume can only happen on a shared-suspendable function

I don't think there's any reason to disallow shared continuations from being resumed from shared-fixed or even unshared functions, right? It's just like calling a shared-suspendable function.

Sorry, I meant that shared-resume can only act on a shared continuation, which must have been created from a shared-suspendable function. I agree that the execution of shared-resume could occur within the body of a shared-fixed or nonshared function.

The fact that disallowing suspendable->fixed calls seems reasonable and that separately disallowing fixed->suspendable calls seems reasonable reinforces my belief that doing neither would be better :)

Now when a shared-suspend happens in a shared-suspendable function, it can just search for the first handler

If we use a "zero-cost" shared-barrier implementation where it acts like a handler rather than proactively setting a bit, then this search can find shared-barrier just as well.

The fact that disallowing suspendable->fixed calls seems reasonable and that separately disallowing fixed->suspendable calls seems reasonable reinforces my belief that doing neither would be better :)

It seems one or the other is needed, because the bad case is a call stack of the form

shared-suspendable (with handler) -> shared-fixed -> shared-suspendable (with suspend instruction)

We need to make sure one way or the other that the middle shared-fixed frame can't be captured in a shared continuation. It seems like the natural way to do this is to require that at least one of the transitions can only be done through a handler, instead of a regular call, so that attempts to do a shared-suspend can be caught. Currently I think restricting the shared-suspendable->shared-fixed direction makes more sense.

If we use a "zero-cost" shared-barrier implementation where it acts like a handler rather than proactively setting a bit, then this search can find shared-barrier just as well.

If we expect shared-barrier to be implemented by implicitly turning all the calls in its body that cross the suspendable-fixed boundary into handlers, I think it would be better to require explicit handler instructions instead.

I had a chance to think about this some more. If we start out by assuming that every shared-barrier must be made explicit and that shared-suspendable and shared-fixed are separate types, then this is what we get:

non-shared to shared-suspendable calls must be within a shared-barrier to avoid the non-shared frame from being captured.
shared-fixed to shared-suspendable calls must be within a shared-barrier to avoid the shared-fixed frame from being captured. This requirement is mandatory; it is not enough to require shared-suspendable to shared-fixed calls to be in a shared-barrier instead because that does not ensure safety when a shared-fixed function calls a shared-suspendable function that initiates a shared suspension.
shared-suspendable to shared-fixed calls do not need to be in a shared-barrier because shared-fixed functions cannot successfully initiate shared suspensions (why not? If due to validation, that would inhibit inlining shared-suspendable into shared-fixed, so preferably they would trap. But due to what shared-barrier would they trap?) and if they call back into a shared-suspendable function, the previous rule must apply so there is no need for a further shared-barrier.
To wrap a shared-suspendable function as shared-fixed:
- If the call is direct, a wrapper function can be used that sets up a shared-barrier and calls the shared-suspendable target.
- Otherwise, func.bind or some similar mechanism is required to set up a shared-barrier and call the underlying bound function indirectly.
- This is no more or less complicated than wrapping any shared function as non-shared, so interop between shared-fixed and shared-suspendable functions is just as limited as between non-shared and shared functions.
Since shared-suspendable and shared-fixed functions cannot be mixed at indirect call sites (without func.bind or similar), each producer will have to exclusively use one or the other, meaning it would be impossible for a producer to support work-stealing and non-shared function parameters simultaneously. This seems bad, but if all non-shared JS objects are wrapped as shared thread-bound data, maybe it can be ok.
shared-barrier can be implemented eagerly or lazily. Since it is always explicit, it can't affect performance implicitly.

On the other hand, if we make all the required shared-barriers implicit and do not distinguish between shared-fixed and shared-suspendable in the type system:

We still need barriers on non-shared or shared-fixed to shared-suspendable calls to ensure safety, but now they are implicit.
Trying to add the barriers to indirect call operators would require those operators to have different behavior depending on whether the callee was shared-suspendable or not. We do not allow instruction behavior to be context-dependent, so this is a non-starter.
Instead, we can trivially ensure all non-shared/shared-fixed to shared-suspendable calls are inside shared-barriers by making the entire bodies of non-shared and shared-fixed function implicitly be shared-barriers.
This would require a lazy implementation of shared-barrier similar to exception handling to avoid paying a performance cost on every call to a non-shared or shared-fixed function. This seems ok.
We would still support explicit shared-barrier for transient non-shared access in shared-suspendable functions and to support inlining shared-fixed functions into shared-suspendable functions. Inlining in the other direction would work because we would allow shared suspensions to be initiated in non-shared and shared-fixed functions; they would just unconditionally trap because they would necessarily be inside the implicit shared-barrier bodies of those functions.
Since shared-suspendable and shared-fixed functions would be typed the same, they would be interchangeable at indirect call sites and toolchains could emit them both as necessary, supporting both non-shared parameters and work stealing in the same program.

Sorry for the wall of text. We should probably move on to a live discussion soon.

Just a few points to add:

explicit case

Note that instead of having a block-level shared barrier instruction, it's possible to instead have a call-level barrier instruction (essentially the resume_barrier instruction I sketched in the OP). I think all of the observations above translate directly to this alternative. My intuition says the call variant would be less controversial.

Also, as I sketched here one can instead restrict the shared-suspendable->shared-fixed direction, which may be more natural. Especially if the concern is "the direction that's restricted becomes hard to inline", it's more ok for shared-suspendable->shared-fixed to be slow as this direction is likely less performance-critical: because of the restrictions on shared-suspendable, one can't actually call any shared-fixed functions that really have nonshared parameters.

One other issue with restricting the shared-fixed->shared-suspendable direction: it may make calls from JS directly into shared-suspendable Wasm slow (morally, JS is also "fixed" so needs similar guards). At least with the shared-suspendable->shared-fixed direction, things only get slower if your code transitions from "fixed"->"suspendable"->"fixed", which we might consider less likely.

Since shared-suspendable and shared-fixed functions cannot be mixed at indirect call sites (without func.bind or similar), each producer will have to exclusively use one or the other, meaning it would be impossible for a producer to support work-stealing and non-shared function parameters simultaneously. This seems bad, but if all non-shared JS objects are wrapped as shared thread-bound data, maybe it can be ok.

It's hard for me to see how a producer could actually support work-stealing and non-shared function parameters simultaneously even in the most optimistic case. I'd bet that "shared-suspendableness" would infect almost every non-trivial function, unless there's a strict static partition at the source/language runtime level, in which case static annotations in Wasm are still ok. I'd even bet that this problem would happen in the implicit case (i.e. most calls would just start trapping if any clever partition were attempted).

EDIT: and I should emphasise again that this is why I still think we push for thread-local functions. If we believe work-stealing is going to be real in the future, we're just kicking the can down the road until then, and complicating the language in the meantime.

implicit case

Instead, we can trivially ensure all non-shared/shared-fixed to shared-suspendable calls are inside shared-barriers by making the entire bodies of non-shared and shared-fixed function implicitly be shared-barriers.

I'd like to understand more explicitly how you'd plan to distinguish nonshared from shared-fixed from shared-suspendable without type system extensions. I can imagine a semantics where nonshared, shared-fixed, and shared-suspendable are bits that live on the dynamic function instance, purely to enable a dynamic trapping semantics for shared-barrier (implicit or explicit). Instinctively this seems a little unfortunate to me, since the bit is very close to a type system extension, just by swapping the dynamic trapping semantics on shared-barrier for a static check.

I also don't have a clear view of how the dynamic check semantics avoids regressing every existing "fixed" function call. Morally it seems like inserting an extra try-catch into every function at the language level, which I wouldn't expect to be costless.

This would require a lazy implementation of shared-barrier similar to exception handling to avoid paying a performance cost on every call to a non-shared or shared-fixed function. This seems ok.

Can you expand on how this works currently for exception handling? This may be the piece I'm missing. I'd expect at least a penalty in compilation time and/or cache effects/branch prediction.

Actually, I realise there's an alternative design that may make more sense. Instead of allowing only shared-suspendable->shared-fixed calls, allow only shared-fixed->shared-suspendable calls. Now when a shared-suspend happens in a shared-suspendable function, it can just search for the first handler. If the handler is for a shared continuation, it's known that all the frames in between are shared-suspendable (because shared-resume can only happen on a shared-suspendable function). If the handler is for an nonshared continuation, trap. Calling from shared-suspendable to shared-fixed is allowed only through a nonshared-resume handler (which would cause shared-suspend in subsequent shared-suspendable frames to trap).

Yes, this is what I had originally envisioned. I had imagined that producers who wanted to use shared-continuations would choose the 'shared-suspendable' type for all of the functions they generate for source language functions, as all of their source language types are likely shared and so the strictest semantics are not an issue. For calling out to JS for local host functions, they would need to perform the barrier at those points.

@tlively

non-shared to shared-suspendable calls must be within a shared-barrier to avoid the non-shared frame from being captured.

That would require having a sequence of A: [shared-suspendable] -> [non-shared] -> B: [shared-suspendable] with a shared-continuation handler in A and a shared suspend in B. But because shared (of any kind) cannot call non-shared, this cannot happen.

Since shared-suspendable and shared-fixed functions cannot be mixed at indirect call sites (without func.bind or similar), each producer will have to exclusively use one or the other, meaning it would be impossible for a producer to support work-stealing and non-shared function parameters simultaneously. This seems bad, but if all non-shared JS objects are wrapped as shared thread-bound data, maybe it can be ok.

Agreed, for producers using shared-continuations, non-shared function parameters can't be used. As I sketched in #42, I believe that we could support a scheme where non-shared context locals and the shared-barrier can be used to access non-shared state inside shared continuations.

I also wonder if we could mix these functions at indirect call sites by having shared-suspendable <: shared-fixed. Shared suspendable has a proper subset of runtime semantics of shared fixed. When doing an indirect call to an unknown (either fixed/suspendable) function, a barrier might need to be done. But if the function type is known to be suspendable, the barrier could be avoided.

@conrad-watt

Can you expand on how this works currently for exception handling? This may be the piece I'm missing. I'd expect at least a penalty in compilation time and/or cache effects/branch prediction.

At least for SM, we implement catch lookup by walking the stack and performing metadata lookup based off of return addresses in stack frames to find which catch handler a call site was in when an exception happens. The advantage is that going into a try block is mostly free at runtime (catch blocks do add control flow edges to handle rejoining from exception paths which can inhibit some regalloc opts, but you can't avoid that). But it's pretty slow in the case that we do actually throw an exception.

non-shared to shared-suspendable calls must be within a shared-barrier to avoid the non-shared frame from being captured.

That would require having a sequence of A: [shared-suspendable] -> [non-shared] -> B: [shared-suspendable] with a shared-continuation handler in A and a shared suspend in B. But because shared (of any kind) cannot call non-shared, this cannot happen.

The situation I have in mind is just [non-shared] -> [shared-suspendable], under the assumption that this kind of call is allowed by analogy to how non-shared functions are allowed to access other shared module items like tables and globals.

Hmm, I'm not sure I follow without seeing where the handler/suspend are in that situation. It also seems like this would be a problem even if we don't split up the function types (as it doesn't involve shared-fixed at all)?

The shared suspension is initiated in the shared-suspendable frame. I wasn't thinking that there would necessarily be a handler, but that we would still want to trap as soon as we find ourselves in a non-shared frame during the suspension. If you argue that that's unnecessary because there cannot possibly be a handler and we'll trap anyway, then the example as I was thinking of it doesn't work. I was assuming an invariant that the semantics should never have a stack walk for a shared suspension traverse a non-shared frame because that makes safety provable with more local reasoning.

I also wouldn't rule out [shared-suspendable] -> [non-shared] -> [shared-suspendable] via thread-bound or thread-local function machinery, although then you're back to the case where putting the barrier on either edge would work unless you're assuming the invariant I had in mind.

I also wouldn't rule out [shared-suspendable] -> [non-shared] -> [shared-suspendable] via thread-bound or thread-local function machinery, although then you're back to the case where putting the barrier on either edge would work unless you're assuming the invariant I had in mind.

That's interesting, I guess with thread-local functions in the proposal as-is we already could have a call stack shared-suspendable -> non-shared -> shared-suspendable and would need the thread-local function to act as the shared-barrier. So engines will need some feature like this under-the-hood either way? Host JS functions are similar, they just block all suspending.

WebAssembly / shared-everything-threads