WebAssembly / shared-everything-threads

A draft proposal for spawning threads in WebAssembly
Other
29 stars 1 forks source link

Context locals redux #66

Open conrad-watt opened 3 weeks ago

conrad-watt commented 3 weeks ago

My understanding is that @tlively and Google folks are currently experimenting with full-fat thread-local globals. In response to concerns about the implementation feasibility of this approach, @lukewagner and @eqrion came up with an alternative design for "context locals". After some futher discussion on what these should look like, this issue is an attempt to present a refreshed design for the context locals feature incorporating the iterations that happened in those discussions (e.g. https://github.com/WebAssembly/shared-everything-threads/issues/42). The sketch below assumes the "no capturing on suspension" variant, and is agnostic as to whether we have separate shared-suspendable and shared-nonsuspendable.

Background

To support useful compilation to shared functions, we need a mechanism for thread-local storage (to accurately compile source-level TLS), and a mechanism for JS interaction (since JS functions are nonshared, they can't be imported and called in the normal way inside shared functions).

Wasm-level thread-local globals solve the former problem, but require ambitious schemes for initialisation and garbage-collection. JS-API thread-local functions solve the latter problem (and can be used to simulate thread-local storage with a "get_thread_id" function), but pose similar garbage collection issues.

Context locals

Context locals aim to provide a basic mechanism for solving both problems. Conceptually, context locals represent storage that is local to the current Wasm call stack. This is resonant with the way engines already fix a "current instance" when entering a Wasm call stack. If a Wasm call stack is suspended and resumed elsewhere (including in another thread), the context locals at the suspension point are not captured - instead the resumed continuation inherits the context locals of the resumption point (with a type check to ensure the shape matches).

These qualities mean that it is safe to put JS functions into context locals, and call them even from shared code. Context locals can also be used to implement thread-local storage, although some additional care must be taken when crossing JS boundaries or making use of shared continuations.

Brief instruction set sketch

Extend function types with a new kind of local declaration - a sequence of types representing that function's "context" (this could also be declared tag-style in an earlier section as in https://github.com/WebAssembly/shared-everything-threads/issues/42).

e.g.

(func $foo (param i32)
  (local i32) (context (ref $t1) (ref $t2))
)

This function declares a context of type (ref $t1) (ref $t2), made up of two context locals.

For simplicity's sake, we'll assume a separate instruction set for interacting with context locals rather than using the existing local... instructions, but a combined scheme may be possible. These will be::

which work as expected. One note - in shared-suspendable functions we'll still need something like the shared-barrier mechanism to allow nonshared results of context.get etc to be manipulated, but context.call would in principle be permissible even without such a barrier.

We also need a block instruction for switching to a new context - (context.switch t* ... end), or alternatively a call instruction that simultaneously switches contexts. This allows functions with mismatching contexts to call each other, and the cost of switching contexts is explicitly represented. A function which declares a context can only be called if its declared context is a subtype of (or for MVP, equivalent to?) the current context. Functions which do not declare a context can still be called from any other function. A context.cast block or call instruction for recovering a context subtype at runtime could be considered, but this would require contexts to preserve RTT information, which is an additional overhead.

JS-API

When a function has a declared context, the context must be bound before a function can be called. Functions with contexts, when exported to JS, have an extra context_bind (bikeshed name) method to accomplish this, which takes the values to be bound as the function's context, and returns a Wasm function that appears to have no context. Shared functions with unbound contexts can be postMessage'd, but the context_bind method on such a function either always returns an unshared function, or alternatively only returns a shared function if all the context parameters are shareable. The intent is that if the context contains any JS function or object, it should be rebound separately in each Worker that wants to call the function.

The reasoning for this separate bind step is to facilitate the compile-time specialisation that V8 has indicated they want to lean heavily on for performance. Due to lazy compilation, when a bound function is called for the first time, relevant context.call instructions can be specialised to the known value of the provided JS function. Since this code is only entered through the bound function, deopt checks are only necessary at boundaries where the context may change (e.g. initial JS entrypoints, and context.switch instructions). Pleasantly, no deopt checks are needed when repeatedly calling an already bound function - only when attempting to call the same function in another instantiation/binding.

The idea is that 99% of the time (including in situations with JS->Wasm re-entrancy) you're just calling already-bound Wasm functions.

Implementation sketch

EDIT: These proposed implementations are not correct, due to issues if the instance is shared across threads. Reader beware!

Here are two possible approaches - the space for the context is allocated inline with the instance, or the context is a separate allocation referenced by the current instance.

inline

When compiling the module and allocating the instance, find the largest context declared across all functions of the module and allocate that much extra space in the instance. When entering a context (e.g. through a call or resumption), copy the relevant values into this space (guaranteed to be enough space for every possible context). This has the advantage of making context accesses fast, but the context locals must be recopied when there is a cross-instance call (although this can be a wholesale memcpy rather than a per-member iteration).

separate allocation

Each instance has space only for a reference to the current context, which is a separate allocation. This has the advantage of not requiring a copy upon cross-instance call, but adds indirections to context access.

Example

(sorry if the syntax is minorly wrong or otherwise undercooked)

console.log

Wasm
(module $module...
  (export $foo "foo")

  (func $foo shared-nonsuspendable (param i32)
    (local i32) (context (externref) (ref func [externref]->[]))
    (context.get 0)
    (context.call 1)
  )
)

JS
inst = WebAssembly.instantiate($module);

inst.foo(0); // not allowed, but can postMessage inst.foo
foo_bound = inst.foo.context_bind("hello world", console.log);
foo_bound(0); // prints "hello world" through console.log
tlively commented 3 weeks ago

To summarize the difference between thread-local globals and this design, it seems that thread-local globals could be lowered to context locals by moving all the thread-local globals of a module into the context for that module's declared functions, then inserting glue code to switch contexts on all cross-module calls where the caller and callee contexts differ.

A benefit of thread-local contexts is then that no switching is required on cross-module calls where the contexts do not differ. As an optimization, a thread-local global implementation could potentially elide the implicit context switch in the analogous case where the callee's thread-local globals are all imported from the caller or vice versa, although checking this condition would still be some amount of work.

The other benefit of thread-local contexts is that there are no shared-to-unshared edges involved, although this benefit will be moot if users end up needing strong thread-bound data references anyway.

The primary downside compared to thread-local globals is that thread-local contexts make cross-producer calls more difficult to produce because the thread-local contexts of each producer become part of their function signature ABI. Another downside is that these cross-producer calls remain expensive in the case that the separately produced modules are merged.

Does that all sound correct?

rossberg commented 3 weeks ago

The most fundamental disadvantage of context locals seems to be that they are inherently unmodular, as implementation details of local state leak into interfaces and moreover, IIUC this leakage is transitive, so inevitably requires whole-program knowledge to be able to funnel everything through to everywhere. Cross-producer calls are just a special case of this very general problem.

This is further elevated by the fact that there is no mechanism to abstract or parameterise over the details of the TLS (like e.g. a monad type would in functional programming with state passing). Would a module be able to ever change or extend its TLS without breaking all clients? I believe some such abstraction capability would be the bare minimum to make this approach scale, at least if we ever want Wasm to be able to express libraries.

The other big problem I see that this interacts poorly (read: not at all) with stack switching. If a caller A1 with TLS calls B, which then suspends and perhaps gets resumed by A2 which has different TLS(*), how would one update the "current" contexts in the suspended call chain of B? This is the full problem of dynamic scoping surfacing, and how to make it compose correctly with other effects.

(*) Not just different values, but possibly different context shape.

(As an aside, can we avoid describing the problem in terms of JavaScript? Either it is generally relevant to host interaction, then it should be described as such, or it is not, then frankly it has no place in Wasm.)

conrad-watt commented 3 weeks ago

@tlively

... then inserting glue code to switch contexts on all cross-module calls where the caller and callee contexts differ.

In a whole-program/closed compilation scenario, all functions would have compatible context locals, so no switching would be needed. More generally, it would be possible to come up with an ABI for separate compilation using indirections like putting thread IDs and structs containing host functions inside context locals instead of direct TLS data and top-level host functions. It becomes a toolchain game - you get a more composable ABI (avoiding switches) by adding indirections, or you can accept the cost of switches in exchange for an ABI with fewer indirections in the TLS.

The other benefit of thread-local contexts is that there are no shared-to-unshared edges involved, although this benefit will be moot if users end up needing strong thread-bound data references anyway.

Thread-bound data references alone aren't enough for unshared host calls, which you also get from context locals. I agree if we get both thread-bound data and functions (both with sufficient efficency/inline-ability), the only remaining benefit of context locals is TLS analogous to thread-local globals.

The primary downside compared to thread-local globals is that thread-local contexts make cross-producer calls more difficult to produce because the thread-local contexts of each producer become part of their function signature ABI

Yes, although see above regarding my point about composable ABIs. My impression is that cross-producer calls already need very careful ABI coordination.

Another downside is that these cross-producer calls remain expensive in the case that the separately produced modules are merged.

If desired, you can avoid the need to switch contexts in the merged module by also merging the contexts by concatenating them and shifting relevant access indices. Think of this as like concatenating the globals of the modules. If there is a good enough ABI it would even be possible to de-duplicate.


@rossberg

The most fundamental disadvantage of context locals seems to be that they are inherently unmodular, as implementation details of local state leak into interfaces and moreover, IIUC this leakage is transitive, so inevitably requires whole-program knowledge to be able to funnel everything through to everywhere. Cross-producer calls are just a special case of this very general problem.

I don't agree that they're unmodular - context locals can effectively be desugared into regular function arguments (although this would be far less efficient) plus some additional type system cleverness to make continuations work better.

This is further elevated by the fact that there is no mechanism to abstract or parameterise over the details of the TLS (like e.g. a monad type would in functional programming with state passing). Would a module be able to ever change or extend its TLS without breaking all clients?

The choice on whether/how to abstract TLS is made at the toolchain/ABI level. For example the context locals could just be used to hold a thread ID, with all TLS state managed in tables and memories (and thus a module could easily extend them without changing the shape of the context). Context locals are a very general feature.

The other big problem I see that this interacts poorly (read: not at all) with stack switching. If a caller A1 with TLS calls B, which then suspends and perhaps gets resumed by A2 which has different TLS(*), how would one update the "current" contexts in the suspended call chain of B?

One of the main motivations of this proposal is to allow unshared to better interact with (shared) stack switching. If you want to resume A1 (context 1) from A2 (incompatible context 2), you need to do a context switch to a context 1 shape at the moment of resumption. This means that the whole suspended call chain of B will see the shape you switch to, and if you return from the call chain of B back into A2, the context switch will fall out of scope.

I'd emphasise again though that the toolchain/ABI has a choice as to how compositional its contexts are. In the "normal" case, I'd expect A1 and A2 to have compatible contexts, but the capability of switching is there for general compositionality.

(As an aside, can we avoid describing the problem in terms of JavaScript? Either it is generally relevant to host interaction, then it should be described as such, or it is not, then frankly it has no place in Wasm.)

The motivation based on successful compilation of TLS is totally host-agnostic - it's purely about successfully preserving the semantics of source languages being compiled to Wasm. Thread-local globals are the alternative solution for this problem.

The fact that context locals also allow host interaction where the host can only provide unshared functions (and I'll unashamedly hold up JS as the key example) is an important additional motivation, but the TLS motivation stands on its own.

rossberg commented 3 weeks ago

@conrad-watt:

In a whole-program/closed compilation scenario, all functions would have compatible context locals, so no switching would be needed. More generally, it would be possible to come up with an ABI for separate compilation using indirections like putting thread IDs and structs containing host functions inside context locals instead of direct TLS data and top-level host functions. It becomes a toolchain game - you get a more composable ABI (avoiding switches) by adding indirections, or you can accept the cost of switches in exchange for an ABI with fewer indirections in the TLS.

To do that, though, we don't need much from Wasm. We can code up a thread-id-indexed map in user space and pass down a single reference (which ought to be fairly cheap). The only argument for making it primitive would be performance. But if we still need indirections through untyped maps, then is there a sufficient win?

The most fundamental disadvantage of context locals seems to be that they are inherently unmodular, as implementation details of local state leak into interfaces and moreover, IIUC this leakage is transitive, so inevitably requires whole-program knowledge to be able to funnel everything through to everywhere. Cross-producer calls are just a special case of this very general problem.

I don't agree that they're unmodular - context locals can effectively be desugared into regular function arguments (although this would be far less efficient) plus some additional type system cleverness to make continuations work better.

Well, sure, but that doesn't make it modular — emulating global state with function arguments isn't modular either.

This is further elevated by the fact that there is no mechanism to abstract or parameterise over the details of the TLS (like e.g. a monad type would in functional programming with state passing). Would a module be able to ever change or extend its TLS without breaking all clients?

The choice on whether/how to abstract TLS is made at the toolchain/ABI level. For example the context locals could just be used to hold a thread ID, with all TLS state managed in tables and memories (and thus a module could easily extend them without changing the shape of the context). Context locals are a very general feature.

That sounds like you are still assuming some form of whole-program compilation or whole-program linking. I don't see how any of this can work with separate compilation and regular, let alone dynamic, linking, except by using untyped maps for contexts, bypassing most of this feature.

The other big problem I see that this interacts poorly (read: not at all) with stack switching. If a caller A1 with TLS calls B, which then suspends and perhaps gets resumed by A2 which has different TLS(*), how would one update the "current" contexts in the suspended call chain of B?

One of the main motivations of this proposal is to allow unshared to better interact with (shared) stack switching. If you want to resume A1 (context 1) from A2 (incompatible context 2), you need to do a context switch to a context 1 shape at the moment of resumption. This means that the whole suspended call chain of B will see the shape you switch to, and if you return from the call chain of B back into A2, the context switch will fall out of scope.

Perhaps I don't understand how the checking is supposed to work. How would the language detect that a suspend/resume/switch switches to a continuation that (currently) expects a different context? Is the check static or dynamic? If the former, wouldn't context types have to bleed into continuation and function reference types everywhere? If it's dynamic, where is the information about the current context type of a stack stored? Does a context switch write the current type to that stack somewhere, for the stack switch to retrieve it and perform the check?

And I don't understand how the implementation can be tied to instances. When we have shared functions and instances, then there can be multiple functions originating from the same instance but running in different threads, active at the same time, such that each of them has to see a different copy of the TLS at the same time. AFAICS, TLS-style state has to be tied to stacks, not instances — for both threads and stack switching to work correctly.

Moreover, the OP implies that this feature avoids the problem of "ambitious schemes" for initialisation of TLS. But how? Where is the TLS context initialised for a new thread? Doesn't that require all thread creation points to know (and be able to access in user code) all TLS initialisers? How would this work with external thread creation, how with internal thread creation? How would this not introduce cyclic dependencies between modules in general?

Sorry if I'm being dense. :)

I'd emphasise again though that the toolchain/ABI has a choice as to how compositional its contexts are. In the "normal" case, I'd expect A1 and A2 to have compatible contexts, but the capability of switching is there for general compositionality.

I'm not convinced. AFAICS, the only real choice toolchains are given with this feature is between unmodular (whole/closed-program) or untyped (map lookup), and both is already possible without it.

conrad-watt commented 3 weeks ago

To do that, though, we don't need much from Wasm. We can code up a thread-id-indexed map in user space and pass down a single reference (which ought to be fairly cheap). The only argument for making it primitive would be performance. But if we still need indirections through untyped maps, then is there a sufficient win?

The thread ID itself still needs to be held by the thread somehow - without context locals you would need to either directly propagate the ID through execution as a function argument, support a thread-local host call (e.g. get-thread-id), or support thread-local globals.

Well, sure, but that doesn't make it modular — emulating global state with function arguments isn't modular either.

That sounds like you are still assuming some form of whole-program compilation or whole-program linking. I don't see how any of this can work with separate compilation and regular, let alone dynamic, linking, except by using untyped maps for contexts, bypassing most of this feature.

Maybe the missing piece here is the ability for a module to optionally bind (some?) context locals at export-time (restricted to const expressions only), so that the exported function doesn't need host intervention (through context_bind) to present a clean interface?

EDIT: the interaction with stack switching would require some thought, but I believe it would actually be ok for bound context locals to be captured in a continuation - the type of the resulting bound function just can't be shared-suspendable if any of the bound context locals are unshared.

I'm thinking of context locals as part of the interface of the module, like imports and function arguments. I agree that their use should be kept to a minimum, in the same way that global state via function arguments should be avoided as much as possible. However if we don't get thread-local globals and functions, unless we consider an additional feature like context locals, the only option we have left is a shared-nonsuspendable semantics where all thread-local data and unshared host interaction is inefficiently passed through function arguments (see bottom of this comment).

How would the language detect that a suspend/resume/switch switches to a continuation that (currently) expects a different context? Is the check static or dynamic? If the former, wouldn't context types have to bleed into continuation and function reference types everywhere?

Yes, the context locals (if declared and unbound) would be part of the function type, like function arguments. All checks are static.

Moreover, the OP implies that this feature avoids the problem of "ambitious schemes" for initialisation of TLS. But how? Where is the TLS context initialised for a new thread? Doesn't that require all thread creation points to know (and be able to access in user code) all TLS initialisers? How would this work with external thread creation, how with internal thread creation? How would this not introduce cyclic dependencies between modules in general?

Sorry, the "ambition" here is referring to issues we've discussed around supporting thread-local globals as Wasm language-features in the runtime. Smaller features like context locals push the engineering of this into user-space. The question of how to run TLS initialisers is really hard in general, both in the runtime and the userspace. If a module wants to do TLS in a totally transparent way for separate compilation purposes, every entry into the module's code (or alternatively every TLS access) needs to be guarded by a "has my TLS on this thread been initialised yet" check. It's just a question of whether this check is done by the runtime as part of Wasm's semantics, or in userspace. The Wasm runtime is essentially always forced into this worst-case scenario, but userspace Wasm code could make optimisations to this scheme based on ABI knowledge, toolchain coordination, etc.

I'm not convinced. AFAICS, the only real choice toolchains are given with this feature is between unmodular (whole/closed-program) or untyped (map lookup), and both is already possible without it.

With the features we have today, the only way to do TLS and unshared host access is to thread everything through as function arguments, and even this doesn't work if we want to do work-stealing with shared-suspendable. Thread-local globals and functions probably give the cleanest abstract interface, but haven't received much support from engines (e.g. because of the initialiser issue just above). Context locals require slightly more wrangling in userspace, but they'd be much easier for engines to implement efficiently (with optimisations like inlining). So we seem to have these options:

eqrion commented 3 weeks ago

@rossberg

That sounds like you are still assuming some form of whole-program compilation or whole-program linking. I don't see how any of this can work with separate compilation and regular, let alone dynamic, linking, except by using untyped maps for contexts, bypassing most of this feature

To try and make this concrete, I've been imagining Emscripten C++ could have a single ABI mandated context of:

(context (i32 $shadowStackPointer) (i32 $tlsBase) (externref $webUnsharedFunctions))

The first field is very hot and also mutable (which makes it not a good candidate for passing by param). The second field would be a pointer into linear memory where the current C++ source module's TLS data can be found. The third field would be an untyped JS map containing all the unshared web functions.

All C++ functions can agree on these parameters, and the hottest values are typed, while the least hot value (the odd case of web unshared functions) is untyped. You should be able to separately compile C++ functions to use this type and link them together.

As part of this, we are pushing the responsibility for updating the context locals into user space. Context locals are just providing the fast mutable storage that's scoped to a call stack. So the tlsBase would need to be updated when one C++ source module calls into another, but not necessarily the shadowStackPointer or webUnsharedFunctions. This is identical to the work the engine would have to do if we had TLS globals in the spec, but the engine knows less than the toolchain IMO, and so we'd do a poorer job at it.

conrad-watt commented 3 weeks ago

(externref $webUnsharedFunctions)

One small note, this may instead need to be something like a ref.func $webUnsharedFunctionsEntrypoint in order to be callable from shared Wasm (EDIT: or an entrypoint function plus a separate map).

lukewagner commented 3 weeks ago

I think the main argument in favor of thread-local globals is that it avoids changing the toolchains' existing core wasm ABI. If thread-local-globals were free and didn't have the highly non-trivial implementation implications we've been discussing, I'd be in favor of it. But if there is a lower-level mechanism that is less magic with a more predictable cost model, I think we should do that (it is the wasm way) and context-locals seem to fit the bill.

There's also a vaguely anti-modular aspect of thread-local globals that I think is concerning: if I'm calling a module that I want to treat as a black box, the identity of the thread I'm calling on really shouldn't matter -- I should be able to call that module on any thread I want. But if the engine is implicitly creating mutable storage locations for me at the boundary that are tied to the identity of the calling thread, the caller's thread identity now matters in a way that really feels like it's breaking some sort of encapsulation boundary (or parametricity property) that you'd naturally expect -- it's like it's an implicit function argument that you can't avoid passing. It also means that implicitly-created thread-local storage locations have an ambiguous lifetime without a good point to call a destructor (in fixed-thread-pool scenarios, this might seem fine, but once you have a thread.spawn, it seems like a real problem).

Considering these problems in a cross-component setting (where we're intentionally aiming to be cross-ABI, cross-language, with black-box reuse), the right answer seems to be to treat each cross-component call into core wasm as-if it was on a fresh thread (regardless of the caller's actual thread identity), so that TLS never gets reused and can always be eagerly destroyed -- anything else leads to leaks or requires ad hoc gross hacks. I know this is a component-level argument so maybe it doesn't directly apply at the core wasm level but, IIRC, it sounded earlier like @tlively came to a similar realization in a totally different context; I'd be interested to hear more about that.

rossberg commented 3 weeks ago

Yes, I'm totally on board with avoiding the problems of "true" thread-local globals. It appears we are all on the same page that the alternative is some form of dynamic scoping. However, my impression is that context locals are an attempt to provide dynamic scoping "cheaply", but cutting so many corners that the result does not interact correctly with other features, while also leaking heavily into types and interfaces, and hence would only work in narrow cases.

In particular, I'm still puzzled how tying context locals to instances can behave correctly with threading or stack switching. I'm pretty sure it can't. But if they are connected to stacks instead, then the mechanism is more closely related to what @tlively presented to the stacks group for dynamic scoping a while ago, with all its implications. I doubt that we'll still get much benefit out of declaring context locals at that point.

conrad-watt commented 3 weeks ago

In particular, I'm still puzzled how tying context locals to instances can behave correctly with threading or stack switching. I'm pretty sure it can't.

Oh, this is a good point. My implementation sketches above don't work if the instance is shared across threads, because everything just gets clobbered.

@lukewagner @eqrion did you have an implementation scheme in mind that I've not correctly reproduced here? Would we need a second reserved slot for the context that's separate from the instance? This wouldn't necessarily regress existing code, since existing functions without contexts wouldn't need this slot.

EDIT: @tlively @rossberg do you have a link to the previous dynamic scoping presentation, or a brief description?

lukewagner commented 3 weeks ago

Agreed that neither the context (nor a pointer to the context) can be stored in a shared instance. Because it's fixed size, the context can be stored at the point it's created on the stack (as part of the trampoline that enters wasm from the host or in the stack frame of wasm code that performs context.switch) or in host memory (e.g., for the context created by the JS API for the context_bind operation mentioned above). Either way, a pointer to this context is then threaded through all nested calls. Of course, special care has to be taken when a stack is suspended on one thread and resumed on another. I think the way this works (in O(1)) is that when you resume a stack and pass in the context, the incoming context pointer is propagated as an implicit return value to callers on that stack (kinda like how changes to the heap-base-pointer are propagated when you pin it to a register). The key here is that the context-pointer is threaded into and out of each call, return, suspend and resume -- it's never kept "live" on the stack across any of these.

There is a tradeoff engines have to make, though, between:

The former has less register pressure but more indirection. There's also hybrids that avoid pinning registers inside function bodies (letting the register allocator do its thing instead). Roughly the same tradeoff exists for the thread-local globals, I should add -- I think this aspect of the designs in the same.

rossberg commented 3 weeks ago

The key here is that the context-pointer is threaded into and out of each call, return, suspend and resume -- it's never kept "live" on the stack across any of these.

Oh, so you also meant to return the current context everywhere. That makes more sense, but how would that work with context.switch? That would have to back up the previous context and restore it upon exit. Where else would it save it but in the stack? And how would it get updated there when the outer context is changed in the meantime, e.g. through a suspend/resume?

Concretely, consider the following pseudo code:

   func f() =
      print x
      context.switch (x := 2)
         print x
         suspend
         print x
      end
      print x

   func start() =
      context.switch (x := 1)
         c := resume (cont.new f)
      end
      context.switch (x := 3)
         resume c
      end

With a correct implementation of dynamic scoping this has to print 1,2,2,3. How can that be implemented by just threading a single context pointer?

conrad-watt commented 3 weeks ago

That makes more sense, but how would that work with context.switch? That would have to back up the previous context and restore it upon exit. Where else would it save it but in the stack? And how would it get updated there when the outer context is changed in the meantime, e.g. through a suspend/resume?

One thing I didn't accurately reproduce from @eqrion's original pitch is that the exit point of context.switch should explicitly pop values from the results of its body in order to restore the context. I think this is especially important in the shared case, to avoid unshared references getting smuggled from another thread. However, I don' think the issue of EH unwinding "skipping" this step has been fully worked through - maybe the restoring/popping of values for the context would need to happen in a separate finally block attached to the context.switch block. One advantage of this is that the finally block could be considered to have barriers making it shared-nonsuspendable, so it would be possible to manipulate nonshared values of the current context (it's hard to do this in the suffix of the main block because of scoping issues).

rossberg commented 3 weeks ago

@conrad-watt, I see, but that seems odd. Then in my example, f would have to save and restore x in user space, which would produce 1,2,2,1 (or 1,2,3,1?), clearly not what I would expect or need. That is, context.switch is no longer a scoped binding construct but simply two consecutive (bulk) assignments to the context. I don't see the point of making it a block instruction and pretending otherwise in that case. It would no longer be dynamic scoping but just thread-local state minus the initialisation magic. Perhaps that's fine, but then the instruction set should be different.

Not sure I follow how this helps with unshared edges, since the user saving the context in locals induces the same edges as the engine doing it internally would?

I agree that the interaction with exceptions is also an issue, either for the engine or for the user.

Regarding type checking, either way, this all essentially requires an effect type system (or technically, rather a co-effect system, but that is largely interchangeable): every instruction, function, and continuation type would have to be annotated with its context assumptions. That is quite a heavyweight and intrusive change. And even with the fairly small context that @eqrion sketched above it would substantially increase the size of all function types, unless there is a way to factor out the context definition, e.g., with a new form of type definition that can be referenced from different function types.

But when the context has to be put on all function types anyways, there is no need to have a new form of context local declaration. Instead, the body's initial context would be determined by the function's own type, like with parameters.

conrad-watt commented 3 weeks ago

But when the context has to be put on all function types anyways, there is no need to have a new form of context local declaration. Instead, the body's initial context would be determined by the function's own type, like with parameters.

I think I described this in a messy way in the OP, but this is what I was envisaging - the context (if declared) is actually a new component of the function type (the "declaration" becomes part of the type, like the "declaration" of parameters in the text format).

With this explanation, does the block version of context.switch make more sense? For typing purposes, the block construct seems the natural way to change the "shape" of the current context, not just the values of its fields.

EDIT:

Not sure I follow how this helps with unshared edges, since the user saving the context in locals induces the same edges as the engine doing it internally would?

The user is simply prevented from saving unshared portions of the context in locals in this scenario as it would be a type error. If the user needs to restore unshared values in the context after a switch, they have to pull them out of the inner context or newly allocate them (consider an ABI with an agreed JS entrypoint as @eqrion sketched above).

rossberg commented 3 weeks ago

With this explanation, does the block version of context.switch make more sense?

No, that is unrelated. If it doesn't preserve the binding structure but just destructively assigns, then it shouldn't be a block but simply an assignment. The shape of the context could be changed regardless, that's just strong update, and effect typing could handle that just fine. We'd already need to enforce that the shape of the context is the same for all join points of branches, so getting rid of the block doesn't change much. The logic would be similar to how we handle uninitialised locals.

The user is simply prevented from saving unshared portions of the context in locals in this scenario as it would be a type error.

Why can't that likewise be a type error if saved implicitly?

conrad-watt commented 3 weeks ago

If it doesn't preserve the binding structure but just destructively assigns, then it shouldn't be a block but simply an assignment.

How is this made safe in the presence of exceptions? With the block approach I can see how a finally component would allow things to be patched up. I don't think context.switch block entry is totally destructive - we want the structure of the outer context to be preserved and restored - in the sense of field types and order. It's the values in those fields that need to be provided again in userspace, since it might not be safe for the engine to capture them across a stack switch.

EDIT: another way of thinking about this might be to have a "call with different context + finaliser" instruction, rather than a block-level switch instruction.

rossberg commented 3 weeks ago

If it's just a regular assignment, then it doesn't even need any additional feature, because it could be handled by encoding a regular finally as usual with try-catch_all-rethrow in user space, like elsewhere. If, on the other hand, it is proper dynamic scoping, then the engine would implement it internally. It is only with this weird block construct that it is a problem, I think.

rossberg commented 3 weeks ago

Given the substantial and cross-cutting complexity of introducing an entire effect system just for this, couldn't we make context access dynamically checked? That would be very straightforward by reusing tags: you'd define the shape of a context as a tag, and there would be 3 simple instructions to manipulate it:

That's it, no extension to the type system or validation, no new declaration, no block structure, completely orthogonal to the rest of the language. The only cost is one check per get/set (comparing the expected against the current tag), which is well within the bounds of what we have been willing to accept in other places.

conrad-watt commented 3 weeks ago

If it's just a regular assignment, then it doesn't even need any additional feature, because it could be handled by encoding a regular finally as usual with try-catch_all-rethrow in user space, like elsewhere. If, on the other hand, it is proper dynamic scoping, then the engine would implement it internally. It is only with this weird block construct that it is a problem, I think.

That seems to only work if we add a full inter-block effect type system for this feature (with join annotations and so on, given our previous decisions). The switch to the new context and the later restoring of the old context need to be tightly paired (since they correspond to crossing an ABI/compilation unit boundary), so I still think a structured block or call-level instruction is a feasible solution and far less disruptive. Maybe the call-level instruction is less objectionable?

The only cost is one check per get/set (comparing the expected against the current tag), which is well within the bounds of what we have been willing to accept in other places.

This might be an acceptable solution, but it's obviously preferable to avoid checks if we can get away with it!

rossberg commented 3 weeks ago

Honestly, I think static checking is gonna be a rabbit hole and its complexity an order of magnitude too high for it to be justified for a corner-case feature like TLS. Especially without measurements, given that the dynamic check is quite optimisable: it can easily be hoisted and subsumed in straightline code. It's only necessary before the first context access in a function, after a suspend/resume, and after a call to a target that might have switched it (which in practice typically means only after calls to imports or indirect calls). So it should be really cheap.

conrad-watt commented 3 weeks ago

I've been thinking more about the dynamic tag idea. I'm coming around to it :)

At the very least it's something that seems quick to prototype and there are clear levers that can be pulled to determine how bad the overhead of the dynamic check is (similar to our cast benchmarking in GC). I think we would still want a context.call as well, to smooth over the experience of calling unshared functions from shared-suspendable code.

What do others think?

eqrion commented 3 weeks ago

@rossberg

Honestly, I think static checking is gonna be a rabbit hole and its complexity an order of magnitude too high for it to be justified for a corner-case feature like TLS. Especially without measurements, given that the dynamic check is quite optimisable: it can easily be hoisted and subsumed in straightline code. It's only necessary before the first context access in a function, after a suspend/resume, and after a call to a target that might have switched it (which in practice typically means only after calls to imports or indirect calls). So it should be really cheap.

No comment on whether the static checking is feasible or not, I'm thinking over the back and forth you had with Conrad on it.

The main motivation for the static checking is for the very hot 'shadow stack pointer' in linear memory languages. Mutable access to that happens very frequently, and the dynamic cast behavior would be really unfortunate if we could avoid it. For other thread/task/realm local things, I could believe that some amount of dynamic checking is acceptable.

The other major motivation for a 'context' feature in the VM was a solution to the problem of how to invoke unshared functions from a shared continuation (unshared continuations could use params to thread all unshared state). This worked by letting the context hold unshared values, but only accessed within some sort of barrier to prevent the unshared values from leaking, then inheriting a new context (with unshared values from that thread) when resumed on another thread.

For problem (1), it seems like we could split that off and ask toolchains to change their ABI and thread through their own 'context' linear memory pointer as their first parameter. That would point to the shadow stack pointer (and any other state they want. This would have a cost for the extra param, but if it subsumes the VM passing an implicit context param it might be net neutral. The static-typed context would be nicer, but doing it in user space could work.

For problem (2), the dynamic context feature that Andreas sketched could solve this problem as well, and I'm less worried about the cost of dynamic checking in this case. If we're inherently doing a dispatch to some thread-local JS/host function, I don't think there is a way to avoid some dynamic checking cost.

So, maybe VM dynamic context + user-mode static context better solves this use-case?

lukewagner commented 2 weeks ago

Just commenting on Andreas's code example above, with the semantics I had in mind, the output would be 1,2,3,3 because there is just a single "current context" threaded through all call/suspend/resume/return control flow. I think that means we don't want to think of context.switch as block-structured or in terms of dynamic-scoping at all. From an "as low-level as possible but no lower" perspective, you could say that what we're exposing is the fact that a normal wasm engine has to thread some sort of "execution context" pointer through all machine code anyways, and so we're allowing guest wasm code to take advantage of this already-maintained pointer instead of having to maintain its own manually with i32 params/results.

rossberg commented 2 weeks ago

@lukewagner, given that the instruction is supposed to be able to switch to a context of different shape (and back), that interpretation would be unsound. But I think we're on the same page that its block structure is inappropriate if the intended semantics and implementation is destructive update.

conrad-watt commented 2 weeks ago

I'm still surprised by the resistance to a block instruction (or alternatively, I should emphasise again, a callsite-scoped switch). When we were first considering the problem of non-nullable locals, our first-pass solution was let, a block-structured approach which was unviable only because of engine constraints on allocation. And even after this, we (eventually) fixed on a solution that didn't require block-level join annotations.

FWIW I would interpret the code above as printing 1,2,3,X (with a caveat about the X below). The required shape of the context for the continuation of f stored in c would be part of the suspend tag and therefore part of the type of the context, so even if the context.switch (x := 2) changed the shape of the context, the final resume c could only succeed if the current context of start is the right shape.

So for example:

   func f() (context x) =
      print x
      context.switch (y := 2)
         print y
         suspend // tag here must have (context y) or subtype
         print y
         (i32.const X) // the exit point of the switch restores the
                       // context to (context x), requiring an explicit value for x
      end
      print x

   func start() (context ) =
      context.switch (x := 1)
         c := resume (cont.new f)
      end
      context.switch (y := 3)
         resume c
      end

Is well-typed and prints 1,2,3,X, but the final resume would be a type error if the line immediately above was context.switch (x := 3) instead.

rossberg commented 2 weeks ago

@conrad-watt, I don't follow. How can the last x possibly produce 3 when x was never even set to that value? Is there a typo somewhere?

conrad-watt commented 2 weeks ago

Oh, sorry - I was focussing on the third print and just autopiloted everything else. I've edited my comment to correct my example and explicitly add the context type for f

rossberg commented 2 weeks ago

How is that clearer than or in any other way preferable to:

   func f() (context x) =
      print x
      context.switch (y := 2)
      print y
      suspend // tag here must have (context y) or subtype
      print y
      context.switch (x := X)
      print x

   func start() (context ) =
      context.switch (x := 1)
      c := resume (cont.new f)
      context.switch (y := 3)
      resume c
conrad-watt commented 2 weeks ago

The latter looks simple because we're dealing with straight line code, but as soon as we involve blocks, loops, and exceptions, explicit join annotations for contexts are needed. We can't even get away with the "forgetful" intra-block semantics of non-nullable locals - the annotations have to appear everywhere for soundness. Since the only purpose of context switching is to facilitate a call to a module with an incompatible ABI, this seems overly disruptive to the language.

If everyone is on board with going all the way to join annotations, I guess I wouldn't be totally against the idea.

conrad-watt commented 2 weeks ago

Actually, I guess we could have an intra-block semantics, if the type system forces user code to restore the previous context before each block exit. Is this something you were considering?

rossberg commented 2 weeks ago

Well, an all-out (co-)effect system is already overly disruptive to the language, this would almost be a minor detail at that point. (Conceptually, those context annotations on blocks would even come for free, since block types are function types, which would already need to be enriched with contexts.)

Btw, in terms of static typing, how would validation know at the first resume that the continuation suspended in a different context than it was entered with? I don't think this can be made statically safe in general without introducing something like session types for functions (that track the context assumption at each suspend). I think the tag-based, dynamically checked variant I suggested above is the only realistic option to transport & verify that information. And then the whole question of block typing is moot anyway. Aw, strike that, I see that you assume that suspend tags are also annotated.

rossberg commented 2 weeks ago

Yes, the block nesting is something I considered. It's no more restrictive than the syntactic nesting of a block-like context.switch. But it also seems like an unnecessary restriction.

rossberg commented 2 weeks ago

me:

Aw, strike that, I see that you assume that suspend tags are also annotated.

Oh, but then we'd definitely need block annotations anyway to make the example expressible, otherwise the jump to the corresponding handler at the resume site could not be typed flexibly enough.

As I said, it's a rabbit hole.

conrad-watt commented 2 weeks ago

Oh, but then we'd definitely need block annotations anyway to make the example expressible, otherwise the jump to the corresponding handler at the resume site could not be typed flexibly enough.

Ah, that's a good point. I guess this means I agree the block switch is wrong, and we would want to go with a full inter-block effect system for the shape of the context (if we went the typed route).

conrad-watt commented 2 weeks ago

So I've been talking through the implications of the full effect system rabbit hole with @titzer, and it really does seem quite scary - in particular the interaction with "legacy" functions that don't have a context declared.

Think about typing a call from a function with a declared context, to a function without a declared context. How do you ensure that, when the called function returns, the context is in the right shape? It seems like you need to either interpret "no declared context" in some kind of ambitious polymorphic way, or alternatively prevent functions with no context from calling switch or any function with a declared context (miserable for compositionality).

This leads me back to the block switch approach. This works better with "no context declared" functions as it allows saving and restoring the parent context without needing to explicitly give it a type (which might need the polymorphism/type variables alluded to above). Its disadvantage, as alluded to here, is its extremely inflexible typing - especially in relation to handlers. Essentially the only way to make it work would be to enclose the whole resume + handler blocks in the switch block, which effectively enforces that the shape of the context at resume point and every suspend point must be identical. I could argue that this restriction is actually reasonable, since conceptually the context is meant to represent a fairly fixed ABI. It does feel totally uncompositional, though.

I think this leaves me hoping that we can find a way to reasonably implement thread-local globals and functions. @tlively mentioned that he would try to gather some V8 feedback, which I'm now crossing my fingers over. If this doesn't work out, I think my second inclination would be towards the dynamic approach @rossberg sketched here, although I'd be worried that engines might heavily lean on speculative optimisations to facilitate fast accesses/inlining.

EDIT: if anyone feels the the block switch approach is reasonable in light of the above I'd also be interested in talking through this, but it does feel restrictive in comparison to design decisions we've made elsewhere.

lukewagner commented 2 weeks ago

I'm probably missing something but: if a function with a declared context calls a function without a declared context, isn't the latter equivalent to a function with a declared-empty context? Thus, before the call, the caller has to switch to an empty context and then on return, the context is known to be empty and so the caller has to switch back to their original context. Amending what I said earlier, I see how this needs a block instruction for switching but, importantly, I think the semantics of this block isn't dynamic scoping or algebraic effects at all; it's typing an implicit function parameter and result -- it's nothing you couldn't polyfill by threading a ref to an equivalent GC struct through all function params+results and continuation suspend+resumes.

You also mention "extremely inflexible typing"; how does this problem arise in practice? My assumption here has been that, because the toolchain ABI is going to mostly fix a single context for all functions, the only switches that need to happen will be at coarse-grained boundaries between code compiled with separate ABIs and thus we don't need to do any fancy polymorphic things like you might need to do with a source-language-level effect system. (That being said, if necessary, I could imagine that we could allow prefix-subtyping of contexts and then a dynamic block-scoped downcast with the same semantics as-if it were a GC struct ref.)

conrad-watt commented 2 weeks ago

Amending what I said earlier, I see how this needs a block instruction for switching but, importantly, I think the semantics of this block isn't dynamic scoping or algebraic effects at all; it's typing an implicit function parameter and result -- it's nothing you couldn't polyfill by threading a ref to an equivalent GC struct through all function params+results and continuation suspend+resumes.

Yes, with a block switch instruction (and therefore "implicit" storage of the parent context) you avoid a lot of the typing issues of the straight-line switch. I think the polyfill view you're sketching here fits this.

You also mention "extremely inflexible typing"; how does this problem arise in practice? My assumption here has been that, because the toolchain ABI is going to mostly fix a single context for all functions, the only switches that need to happen will be at coarse-grained boundaries between code compiled with separate ABIs and thus we don't need to do any fancy polymorphic things like you might need to do with a source-language-level effect system

I personally agree with this perspective, but it certainly feels like a lower level of compositionality than we've previously (collectively) accepted in Wasm features. As one concrete example, as a natural consequence of the typing rules you can't even br from code inside a switch block to a label outside if the contexts don't match - a similar compositionality restriction was one of the main criticisms of the first draft of funclets. Since switch is meant for coarse-grained ABI switches, maybe this is more ok - but I think the group's attitude would need to be carefully tested :)

conrad-watt commented 2 weeks ago

Actually I realise that I've failed to hold some of our previous discussion in this thread in my head.

First, the semantics for switch block that @eqrion proposed involved explicitly taking new values off stack at the end to repopulate the parent context, rather than implicitly saving the previous values. This avoids the problem of such implicitly saved values (which may be unshared) getting captured by a shared continuation. However, if you're in a "legacy" function where the context is "unknown", it seems hard to make this work, since you don't know the shape of the context to restore on exit. If you interpret a missing context annotation as an "empty" context, this is still problematic, because if you call such a function from a function which does have a context, and switch in the called function, you'll restore an empty context and thus mess up the caller's context.

On the other hand, if you assume a semantics where the parent context is implicitly saved, this saves you in the "legacy" case, but is problematic for shared continuations since you might capture an unshared thing.

rossberg commented 2 weeks ago

I agree that the typed approach is scary — but the reasons you considered are not even the real problem. It sounds like you expect that we could get away with using unannotated functions/blocks/tags/references/tables/etc in a polymorphic manner. I'm pretty sure that won't fly.

As I see it we'd certainly need to annotate everything everywhere, and that's not practical and has terrible composability. I'd make a bet that no attempt to partially hide away the annotations is gonna work properly, block switch or not. With regard to types it's all or nothing.

@conrad-watt, what you observe above is that the contravariant nature of explicit context restore would immediately destroy even the simplest form of subtype polymorphism on contexts. Not that subtype polymorphism is expressive enough in the first place, as we can learn from the practical failures of effect systems. And the problem of avoiding unwanted capture in the implicit case looks related to the problems the Scala folks ran into with their latest capability-based attempt of typing effects, which requires less annotation but then is too weak to prevent unwanted effect capture unless you start statically tracking "capture sets" everywhere.

lukewagner commented 2 weeks ago

Yes, I think it's reasonable to assume that, like shared, a single context type would permeate every function, and I'm not aware of why we'd need to do anything fancier to try to avoid this. This is how native ABIs work today with their "Thread Information Blocks" / "Thread Control Blocks" and, in our context, it lets the language's compiler control how higher-level (and varied) concepts like thread-local variables are implemented, instead of baking a fixed scheme into the runtime, which is where we're getting all this runtime magic and performance-unpredictability.

conrad-watt commented 2 weeks ago

@lukewagner we'd still have to work out how to deal with the "fork" I outlined above.

If switch block exit explicitly restores the parent context values, then interoperation between "context-annotated" and "legacy" (current) functions seems to break. I think you'd have to brutally spec that switch is simply not allowed in un-annotated functions, and therefore they can't call annotated ones.

If switch blocks instead implicitly save and restore the parent context, the obvious semantics for this is incompatible with shared continuations. I think you'd have to interpret switch blocks as implying a shared barrier. Also, this semantics uses implicit storage, which was one of the objections to the exnref-less exception handling proposal.

titzer commented 2 weeks ago

My two cents: I was intrigued by the notion that we could statically-type contexts because it allows various engine optimizations that basically impossible to do otherwise. For example, working back from the machine code that I'd like to get for certain patterns, statically-typed contexts allows defining what are essentially a set of callee-saved registers across the scope of functions that share a context. That is useful, e.g. to implement an interpreter or state machine that is spread over many functions (think dozens or hundreds) but nevertheless share a large amount of common state (i.e. the interpreter or state-machine state). A statically-typed context could be register-allocated to a fixed set of registers without an inter-procedural analysis. Since contexts are effectively tied to stacks, they represent a new type of storage that is difficult to emulate efficiently another way.

That said, working through some of the type system issues with @conrad-watt over the past few days, I am starting to agree with @rossberg that polymorphic effect typing might end up being a quagmire.

lukewagner commented 2 weeks ago

@conrad-watt If we stick with the intuition that contexts are threaded into and out of each call as-if by a ref param/result of the declared context type, and if we consider a legacy/unannotated function as having a declared-empty context, then I think that rules out switch-less prefix-subtyping of contexts, because if f has a non-empty context and calls g with an empty context, then when g returns, f must assume the returned context ref is empty (b/c it's subtyping, not polymorphism). Thus, "context-annotated" functions must always explicitly switch contexts to call "legacy" functions (and switch back at the end of the switch block). ("Legacy" (= empty context) functions could of course symmetrically switch to non-empty-context functions, though.) Also, I think this implies that the explicit switch-at-end-of-block semantics is what we need.

rossberg commented 2 weeks ago

All typing issues aside, what's missing in this discussion is somebody actually working through a formulation of a scheduler implementing green threads and TLS, based on (some variant of) stack switching combined with this feature. I suspect that won't be possible with some of the restrictions that have been suggested.

I also suspect that any block-based or otherwise well-bracketed context semantics is inherently incompatible with the use of a symmetric direct stack switch for implementing scheduler-less green threads, like some folks envision it.

Moreover, my guess is that such green threads will want some kind of contextref to be able to switch between threads with minimal cost. Unless contexts have proper dynamic scope, then (and I believe only then) this isn't needed.

conrad-watt commented 2 weeks ago

@lukewagner ok, I think interpreting existing functions as having an empty context, with no fancy subtyping or polymorphism, works. I think you would still need explicit context annotations on all blocks and tags (to ensure things like inappropriate br out of a switch block would be a type error).

@rossberg I think if we went all the way down the route of switch block (with annotations on blocks, functions, and tags), the whole toolchain would most often just fix a single context across all the functions it knows about, like the ABI @eqrion sketched above. Anything much more complex would likely fail to type (especially in the presence of exceptions/continuations). The boundary between toolchains/ABIs can be crossed with an explicit switch block, but the type system would severely restrict how control flow can go across this boundary. So in your example the scheduler and all the threads scheduled on it would need to agree on a context shape (or at least a very coarse-grained switching discipline) ahead of time as part of a toolchain/linking step - it wouldn't be possible to express a fully generic scheduler (over possible contexts) in pure Wasm (is this a goal?).

rossberg commented 2 weeks ago

@conrad-watt, I do indeed think it should ultimately be possible to write a context-generic scheduler, but even with only a single shape you somehow need to save & restore the contexts when switching threads. I can see how that should work with an indirection through a central scheduler (though I suspect you'd still want contextref to make that cheap), but the devil might be in the details. It's less obvious with a direct stack switch, given a block-like context switch. In both cases satisfying the structure and/or typing may necessitate redundant context switches, which also seems undesirable.

And of course the real tough nut is how to make these solutions scale to multiplexing green threads across multiple hardware threads, i.e., work stealing.

conrad-watt commented 2 weeks ago

It's less obvious with a direct stack switch, given a block-like context switch.

(the below assumes the "explicit restore on exit" interpretation of block context switch, to avoid the issue of capturing the parent context inappropriately)

If you have context types in the tags/continuation types, is there a problem? The typing rule would say that the continuation you want to switch to must have a context type matching your current context.

With a delimited handler, you'd likely need to ensure that the current context in-scope at the handler matches the declared context of all the tags it's handling, which is a brutal type system restriction, but not problematic for the "one shape" case.

rossberg commented 2 weeks ago

@conrad-watt, consider scheduler-less thread switching. How would a block-scoped context switch work? The switched-to thread does not return the way it's entered. It is sort of like a tail-switch. Hence, there is no obvious extent for a block-like construct. There is no restore point at all. You keep switching to the next context in a round robin fashion.

Even with a scheduler, you probably don't want to restore the previous context redundantly but just switch to the next one right away.

This is not a typing problem, but a basic mismatch with the way context switching would be structured. Types only make it more obvious that something is at odds and that this approach is too narrow.

Semantically, it is no question that this problem is equivalent to establishing proper dynamic scoping. The threaded context idea could be good as a lower-level primitive, but only as long as it actually is able to emulate dynamic scoping correctly and efficiently — that should be the litmus test.

lukewagner commented 5 days ago

Thinking about it a bunch more, I think a "context" is a linear struct value that:

Because of the linearity, this basically matches the engine-internal context structures that exist today (under various names) that are allocated on entry to wasm (e.g., in the stack frame of the entry trampoline) and are propagated through all calls by pointer. The new thing is allowing guest code to store guest fields inside this context structure.

This understanding suggests some tweaks to the proposal as presented above:

(1) Instead of having a block-like instruction for changing the current context type, we instead track the current context type as part of (or paired with) the current operand-stack type such that it can be strongly updated by any instruction. Thus, I think we can have a simple non-compound context.update $newcx instruction that updates the context type to be $newcx at the next instruction. After a context.update, normal block typing rules would end up forcing context.updates to switch back to the declared end-of-block or end-of-function context type. This also means there is no problem allowing context subtyping when one function callers another; the current context type after the call will just be determined by the callee and thus, if it doesn't match the end-of-block context type, a context.update will be necessary.

(2) Function/effect types should be able to declare two context types: one incoming (as part of parameters) and one outgoing (as part of results). If you don't declare one, it's equivalent to declaring an empty context. Thus a function can declare different incoming and outgoing context types (and I'm even aware of a meaningful use case for having them be different, viz. around context initialization).

(Together, I think the above 2 tweaks address some of the incongruencies Andreas was pointing out above.)

(3) Because the expected implementation technique wants to allocate a single fixed-sized context structure in the activation record of the entry trampoline (and thread a pointer to it through all transitive calls), we need a low implementation limit on the size of context type so that context.updates can always store into this fixed-size memory region on the stack (that was allocated without knowledge of all the different context types that it would be context.updated to in the future).

(4) Due to the abovementioned low-upper-bound on context size, the right way to achieve our primary goal of shared wasm being able to call thread-local JS functions is to leverage the opaquely-propagated hostcx mentioned above. In particular, the context_bind method that Conrad mentions above would configure JS API state that goes into the hostcx and then we could add JS built-ins (importable and callable via shared wasm function type) for calling the thread-local JS functions stored in the hostcx. I won't go into details, but I think it should be quite possible to give the JS engine all the upfront invariants it needs to efficiently compile wasm-to-JS calls.

(5) As a consequence of the above, the context type won't need to contain all the thread-local functions as nonshared funcrefs and thus there shouldn't be a need for a context.call. This also avoids situations where I expect folks were thinking we'd need some form of polymorphism to allow parts of the code (e.g., libc) to be compiled without knowledge of the complete list of JS callees. Instead, a small fixed context-type should be definable by the toolchain ABI with only few fields (e.g. shadow-stack pointer, thread-id, TLS array pointer) and used almost everywhere by all code compiled by the same toolchain. Even when threads aren't involved, when there's multi-module linking, the context is a more-efficient place to store the shadow-stack pointer than an aliased (imported/exported) global (which requires an extra indirection). And in the case of single-threaded, single-module code, the context should be no less efficient than a non-aliased global. Thus, the toolchain should be able to have a single fixed context type defined by the ABI for all code, regardless of threading model, without penalizing (and sometimes improving) performance.