WebAssembly / shared-everything-threads

A draft proposal for spawning threads in WebAssembly
Other
34 stars 1 forks source link

Making thread-local functions viable #46

Open conrad-watt opened 7 months ago

conrad-watt commented 7 months ago

The state of things

There are two key problems that we absolutely must solve to express any reasonable compilation scheme using this proposal.

  1. How do we support thread-local storage in a compilation scheme where all Wasm functions are marked shared?
  2. How can we call JS functions from shared Wasm functions?

Discussion recap

The initial draft of this proposal aimed to solve (1) with thread-local globals, and (2) with thread-local functions. However concerns around implementation feasibility were raised about both of these (https://github.com/WebAssembly/shared-everything-threads/discussions/34, https://github.com/WebAssembly/shared-everything-threads/issues/42).

In https://github.com/WebAssembly/shared-everything-threads/issues/42 @eqrion proposed a different approach, which (interpreted minimally) involves using function parameters/contexts to pass around a thread ID (allowing (1) to be handled programmatically) and a record of important JS functions in the current thread (allowing (2) to be handled through ref.call). The latter requires us to relax our interpretation of the shared annotation to allow nonshared reference parameters.

This alternate approach has implications for future shared continuations (work stealing) which I talk about here (https://github.com/WebAssembly/shared-everything-threads/issues/44). If we believe shared continuations will exist in the future, we will need both versions of shared at once - shared-fixed (allows non-shared params) and shared-suspendable (disallows as in our current semantics). The latter will need its own mechanism for solving (2), with all of the same constraints we're trying to avoid now. Moreover, we will need to work through the design for composing shared-fixed and shared-suspendable function calls, potentially requiring extra language features to facilitate this (such as the shared-barrier discussed in the issue).

Thesis

The more I work through the details of the above, the more I'm struck by the amount of future core language complication, and standardisation effort, that we could avoid if we can find an acceptable solution for (2) now that also works with the shared-suspendable semantics.

Therefore, I believe we should redouble our efforts towards this end. If we find a solution, we avoid a lot of future mess. For now, I'm happy to consider (1) "minimally solved" through the thread ID-passing strategy.

Some possible ways forward

These are to spark discussion. I hope people can brainstorm variations of these, or other fresh ideas.

Eval in current realm

Introduce into the JS-API a new function, which I'll call eval_realm for simplicity, importable as shared. This takes a shared externref, interpreted as a string, and calls the current JS realm's eval function on this string. As discussed with @syg, shared Wasm functions will need a per-realm prototype, so I think this per-realm dispatch can be semantically justified.

Through compilation scheme wizardry (such as creating an initial table of meaningful strings), arbitrary JS access can be bootstrapped from this function, although it would be quite slow! To make key functions faster, this could be combined with the below strategy, with eval_realm used as a fall-back to invoke arbitrary JS.

Make more JS builtins shared

In the spirit of @eqrion's string-builtins, expose a larger list of functions (such as Math.*) that are importable as shared. Possibly come up with some lightweight standards process to add additional functions. This has the bonus of clearly supporting inlining optimisations, but doesn't allow the execution of arbitrary user-defined JS, unless combined with the above eval_realm strategy.

Revisit thread-local functions

I still think a weak form (ref "flavor 2" here) of thread-local function could potentially be viable in the short-term. The semantics I envisage: the thread-local function's ephemeron in a thread is guaranteed not to be collected only so long as the thread-local function is rooted through purely non-shared references in the same thread. If the above ever becomes untrue, future calls to the function in this thread may non-deterministically trap.

There are two objections:

  1. this exposes implementation-specific behaviour due to the non-deterministic trapping if the function was ever eligible to be collected;
  2. this still requires some GC engineering, to ensure that the thread-local GC knows how to follow non-shared references up to the shared thread-local function, and handle the keeping-alive of the underlying ephemeron correctly in this case. My semantics above allows some simplifying assumptions:
    1. there's no need to transitively walk the shared heap, only immediate peeks into it to find thread-local functions are needed;
    2. It's safe for the thread-local GC to walk into the thread-local function, since scary changes to the shared heap can only happen during a stop-the-world, which can't be carried out while the thread-local GC is taking place.

I believe the objections above could possibly be overcome if our alternative solutions are unattractive or require greater implementation effort, and through comparison to the implementation-specific behaviours already exposed through WeakRef and FinalizationRegistry.

If there are orthogonal concerns about the implementation complexity of thread-local function bind, it's possible to remove this function, creating a restricted form of thread-local function that can only be called in the thread in which it is first created/bound. Note the similarities to @syg's sketch for thread-bound JS objects. To facilitate "cross-thread" calls of (e.g.) console.log, the compilation scheme would need to create a thread-bound-wrapped console.log in each thread, and use the thread ID to dispatch to the correct one (so each call site for console.log in Wasm becomes a lookup in some table of thread-bound functions, based on the thread ID).

Thread-bound and weak thread-local functions still give us a forward-compatible path towards the "ideal" semantics. The former can be accomplished by re-introducing bind. The latter, through interpreting the strong semantics as turning the non-deterministic successes/failures of the weak semantics into deterministic successes.

eqrion commented 7 months ago

This alternate approach has implications for future shared continuations (work stealing) which I talk about here (https://github.com/WebAssembly/shared-everything-threads/issues/44). If we believe shared continuations will exist in the future, we will need both versions of shared at once - shared-fixed (allows non-shared params) and shared-suspendable (disallows as in our current semantics). The latter will need its own mechanism for solving (2), with all of the same constraints we're trying to avoid now.

In case it was missed, #42 did include a sketch for supporting shared continuations calling into local JS using the shared-barrier and context locals.

Eval in current realm

How is eval_realm expected to return results back to wasm? If it's through an 'unshared' externref, then this seems like it gets you back to splitting the function types again, at which point using params/context seems better.

But maybe you could restrict to shared results? Still seems slow though.

Make more JS builtins shared

For builtins like Math, this makes sense to me and I'd support regardless. When it comes to more complex things like console.log, fetch, DOM API's, etc, I believe this would require per-standard work to define what it means for each thing to be shared, and to remove references to realm-local things. I don't have a sense of how much work this would be, or how many would need to be done to be considered viable for toolchains to target this.

Revisit thread-local functions

wrt. the 'weak form' here, I am still skeptical that this is viable. This means that toolchains must manually 'root' all their thread-local functions so that they're always reachable through unshared references. We also need to come up with some definition of what refs are shared/unshared (do instance imports count?). And it has also sounded like some engines wish to move to the strong semantics (which would be allowed by the weak semantics) at which point that engines that only implement the weak semantics are at high compat risk of breakage.

When it comes to 'thread bound' references, are the 'weak' semantics acceptable for the use-case in #37 (holding a DOM node alive by a shared object)? That would require rooting every DOM node referenced in a ThreadBoundBox so that they don't get collected. Which again, seems fragile especially in a world where some engines may implement the strong semantics.

tlively commented 7 months ago

@eqrion, would you be ok with the strong shared-to-unshared GC semantics with no guarantee of cycle collection? (Or perhaps a promise of no cycle collection for compatibility?) This would be equivalent to the semantics of supporting shared-to-unshared edges in FinalizationRegistry but not WeakMap, as we discovered with @lukewagner in the last meeting.

I would strongly prefer that semantics to the weak semantics.

That would be enough to let us have thread-local functions callable as normal (shared) imports and thread-bound data to let shared functions pass arbitrary JS objects around as (shared externref).

lukewagner commented 7 months ago

Thinking about the difference between FinalizationRegistry and WeakMap a bit more after the last meeting, I realized that the fundamental difference is that the FinalizationRegistry simply isn't creating an edge from the key to the postmortem callback at all: it's just rooting the postmortem callback and that is it, as far as the GC is concerned. Thus, I don't think we should consider a (key, callback) registration in a FinalizationRegistry to be a shared-to-unshared edge at all.

I haven't had time to consider it fully, but I also liked the sound of what @eqrion suggested in #42 for solving problem (2) because, while (2) may be showing up acutely in a JS context right now, it seems like the problem is not limited to JS and could come up in a pure wasm setting. E.g., let's say I create N unshared module instances in N Web Workers, and I have them all import 1 shared module instance, and I want the shared module instance to be able to call back into the calling unshared module instance (which is really just restating problem (2) in pure-wasm terms). It's hard to say how valuable this pure-wasm scenario is, but giving wasm expressive parity does seem generally good if it's not insanely complicated.

But lastly, if we're exploring alternative solutions to (2) that work at the JS API level to avoid adding two kinds of shared at the core wasm level: could we also consider the idea (2) I proposed in this comment (which coincidentally also broke down the problem into the same two subproblems (1) and (2))? In particular, it has no special GC interactions and seems sufficiently optimizable. I don't know if it was unclear and I should explain it in more detail, or if there was a problem with it or requirement it wasn't meeting, though.

conrad-watt commented 7 months ago

In case it was missed, https://github.com/WebAssembly/shared-everything-threads/issues/42 did include a sketch for supporting shared continuations calling into local JS using the shared-barrier and context locals.

I think I'd mentally lumped this into a "really add a context local feature" bucket which I'm still not sure about. I do like the idea that at each resume point you just naturally have to reconstruct the current environment/thread's context. One could almost think of the context locals as the only bit of state that isn't captured by a continuation. This interpretation makes them meaningfully distinct from function arguments and regular locals.

Taking this last point further, instead of saying that resuming the shared continuation requires the reconstruction of the unshared state, could we say that resuming any continuation must inherit the current context at the resumption point wholesale? The context shape would be part of the type of the function/continuation, just like regular function arguments, with a static error if the context at the resumption point doesn't have the right matching shape. This is like fixing an ABI that must be respected in order to interact with captured continuations. We could make switching contexts imply a suspend barrier to facilitate this.

With this, there's a minimal design that gets us JS interaction that assumes all shared functions have shared-suspendable semantics, so no need for shared-fixed:

add context locals with just a context_get instruction (mainly used to get the thread ID, not allowed on nonshared refs while in a shared function), and a context_call instruction (allowed even on nonshared funcrefs while in a shared function, interpret as morally shared-barrier + context_get + call_ref). To be friendly, we could add explicit instructions like shared-barrier (potentially obviating the need for a separate context_call instruction), context_set, context_switch, call_ref_with_explicit_context, and resume_with_explicit_context instructions, but they wouldn't even be necessary as an MVP. For all calls and resumptions, the context types/shapes of the caller and callee must match (modulo any explicit context switching done through JS and/or context_switch, both of which would imply suspend barriers).

How is eval_realm expected to return results back to wasm?

I think we'd have to put our Wasm 1.0 hats on, and assume that any functions called in this way can't meaningfully pass references back into shared Wasm (except through a scalar indirection) until JS itself has shared objects.


Thus, I don't think we should consider a (key, callback) registration in a FinalizationRegistry to be a shared-to-unshared edge at all.

I'm gesturing at FinalizationRegistry in a different way this time, just to point out that it reveals implementation-specific GC behaviour - so a weak semantics for refs that reveals implementation-specific GC behaviour shouldn't be totally off the table.

The reason I think FinalizationRegistry is connected to shared-to-unshared edges is because if the implementation can consistently trigger the callback when a shared key is fully collected cross-thread, this capability also allows the implementation of shared-to-unshared ephemerons (through the "strong table cleared by callback" strategy I sketched in the meeting). Therefore if we think a well-behaved FinalizationRegistry with shared keys is viable, we should also consider viable all the features that assume shared-to-unshared ephemerons are ok.

if we're exploring alternative solutions to (2) that work at the JS API level to avoid adding two kinds of shared at the core wasm level: could we also consider the idea (2) I proposed in https://github.com/WebAssembly/shared-everything-threads/discussions/34#discussioncomment-8237794

I think I don't have a clear picture of how this would work. If the outer JS function wrapper appears simply as unshared, how does it get called from shared Wasm? If the intention is that the underlying "wrapped" shared function is still callable, what is the semantics if it's called in a thread where the unshared wrapper was collected? Does this end up looking similar to the weak semantics for thread-local functions? (edit: although if so, one advantage of your approach is that much less concrete GC engineering is needed, compared to my weak thread-local functions sketch in the OP).

lukewagner commented 7 months ago

One could almost think of the context locals as the only bit of state that isn't captured by a continuation. This interpretation makes them meaningfully distinct from function arguments and regular locals.

I could be wrong here and @eqrion please tell me if so, but my understanding of context-local storage in #42 (which I understood as an excellent type-y generalization of the stack-local storage I suggested in #34) is that context-local storage is mostly like function arguments and locals in that the storage does follow the stack around, just like params/locals. It's only the special case of nonshared references which must be reset to default values upon stack-switch. But, for example, if you store the shadow-stack pointer (an i32) in context-local storage, it will follow the stack around from thread to thread (which is what you want).

if the implementation can consistently trigger the callback when a shared key is fully collected cross-thread, this capability also allows the implementation of shared-to-unshared ephemerons (through the "strong table cleared by callback" strategy I sketched in the meeting). Therefore if we think a well-behaved FinalizationRegistry with shared keys is viable, we should also consider viable all the features that assume shared-to-unshared ephemerons are ok.

Maybe I'm missing the nuance you're getting at, but I thought that the key difference we discussed in the meeting was, when you attempt to use a FinalizationRegistry in this way, it prevents cycles from being collected that a true shared-to-unshared GC edge (say, of the kind you would get from a shared-to-unshared WeakMap) would allow to be collected. Hence, they're not edges (nor, as best as I can tell, ephemerons, which also don't mark the value unless the key is reachable).

I think I don't have a clear picture of how this would work.

Given that you're already familiar with them, I think the best way to understand the JS API approach I was suggesting in #34 is in terms of algebraic effects: calling from JS into wasm would install a "call-unshared" effect handler, then we introduce a JS built-in that can be imported by wasm as a shared function and, when you call it, it performs the "call-unshared" effect. The type of the "call-unshared" effect would be [i32, T*] -> [U*] (polymorphic, determined by the core wasm import type). The "call-unshared" handler then simply performs a call_indirect into a bound (to the JS-to-wasm export wrapper function) WebAssembly.Table of nonshared funcrefs (which could be JS functions wrapped via new WA.Function()) and resumes the continuation with the return value. To be clear, the actual implementation wouldn't need to suspend/resume -- it could just do a synchronous indirect function call with a cheap guard, but hopefully this illustrates the idea of how to call from a shared function into the dynamic enclosing nonshared context without any weak or shared-to-unshared edges.

(Now that I talk through this, though, it does really seem like something we could just as well do in core wasm; there's nothing JS-y about it.)

conrad-watt commented 7 months ago

context-local storage is mostly like function arguments and locals in that the storage does follow the stack around, just like params/locals. It's only the special case of nonshared references which must be reset to default values upon stack-switch. But, for example, if you store the shadow-stack pointer (an i32) in context-local storage, it will follow the stack around from thread to thread (which is what you want).

I think this fits the sketches from previous issues. What I'm proposing is essentially an alternative design that runs with this "special case" - quote:

instead of saying that resuming the shared continuation requires the reconstruction of the unshared state, could we say that resuming any continuation must inherit the current context at the resumption point wholesale?

This means that we no longer need to reason about context locals being captured (edit: and thus a shared-fixed/suspendable distinction is no longer necessary) - they really are just a record hanging off a pinned register, and if you want to resume a continuation, you need to make sure that the shape of the record fits what that continuation expects.

In practice, I think toolchain ABIs would use a fixed list of JS functions for their context in all shared Wasm functions, so continuations will "just work". If this is a long list, maybe it makes sense to go with shared-barrier rather than a special-cased call_context, so that the JS functions can still be called if boxed inside an array/struct.

EDIT: and moving between ABIs is still possible through some context_switch block, but this would imply a suspend barrier so it's still not possible to capture any context locals - in a suspend handler, it's therefore known that the suspended code must have the same context locals that the handler has

conrad-watt commented 7 months ago

Splitting this edit out into a separate comment as I think it's an important point:

The behaviour I'm proposing is also more consistent with an interpretation of context locals as a version of thread-local storage that is only scoped to a Wasm call stack (i.e. allocated/determined at the JS->Wasm transition point). If you resume a continuation in another thread, you get the context locals for that thread's current call stack, instead of attempting to capture the suspendee's original "call-stack-local storage" which conceptually lives at the base of the stack and not in the portion that was suspended.

lukewagner commented 7 months ago

Ah hah, I see it now, and I really like it. So, iiuc, your proposal implies a simple GC stack root and thus avoids all the GC complexities. It also seems efficient to implement (no extra anything on indirect or import calls). If that's right, it would seem to check all the boxes. Would you like to name your new proposal (to distinguish it from the last N iterations)?

conrad-watt commented 7 months ago

We could call this semantics "call-stack-locals"? I actually like the "context locals" name more (if we have the switch safety valve) but I agree it's valuable to disambiguate from previous versions (at least for now). "Call-stack context locals" might also work but is a mouthful.

EDIT: let's just call them "call-stack-locals" so long as they're one of several options

lukewagner commented 7 months ago

Yes, "context locals" does sound like the right name; perhaps we could call your proposal "context locals variant 2" in the short-term and drop the "variant 2" if we go with it once the dust settles.

conrad-watt commented 7 months ago

It's worth noting that your point here

But, for example, if you store the shadow-stack pointer (an i32) in context-local storage, it will follow the stack around from thread to thread (which is what you want).

is fair. I think this is manageable with the trick @tlively laid out previously - if you want such a value to be captured by the continuation, just read it into a stack slot or local variable before suspending, and then "restore" it upon resumption. In general I think the ABI will need to coordinate carefully on how the shadow-stack pointer is managed in the presence of shared continuations no matter what specific semantics we pick.

eqrion commented 7 months ago

@tlively

@eqrion, would you be ok with the strong shared-to-unshared GC semantics with no guarantee of cycle collection? (Or perhaps a promise of no cycle collection for compatibility?) This would be equivalent to the semantics of supporting shared-to-unshared edges in FinalizationRegistry but not WeakMap, as we discovered with @lukewagner in the last meeting.

I would strongly prefer that semantics to the weak semantics.

That would be enough to let us have thread-local functions callable as normal (shared) imports and thread-bound data to let shared functions pass arbitrary JS objects around as (shared externref).

I have to think about that a bit. I'm not sure how that would be specified or communicated to users. As Luke said above, the FinalizationRegistry isn't really an edge, just a callback, so its inability to collect cycles makes sense (and users are likely able to reason through that). Whereas here, we'd have something that is an edge in the graph, but we'd just be saying that cycles through it are not collectible.

@lukewagner @conrad-watt

Re: call-stack-locals, that seems like another interesting way to do it. It may be cleaner than specifying some ad-hoc 'shared context locals gets defaulted on suspend'.

@conrad-watt

is fair. I think this is manageable with the trick @tlively laid out previously - if you want such a value to be captured by the continuation, just read it into a stack slot or local variable before suspending, and then "restore" it upon resumption. In general I think the ABI will need to coordinate carefully on how the shadow-stack pointer is managed in the presence of shared continuations no matter what specific semantics we pick.

I wrote/said this somewhere else, but don't want it lost. One engine difficulty for TLS globals is that if the engine is pinning the TLS block to a register (as expected for performance), when resuming a shared continuation we need to iterate over all the stack frames we're resuming to lookup the new TLS blocks for the new thread. I don't think this will be cheap, it's at least a linear cost. The other alternative would be to re-lookup TLS blocks whenever leaving from a module crossing (in addition to entering a new module), but that adds cost even if you're not suspending.

But then, if it's expected that toolchains will just undo all of this by setting the SSP to the value it had on the old thread, all of that work is wasted.

conrad-watt commented 7 months ago

Re: call-stack-locals, that seems like another interesting way to do it. It may be cleaner than specifying some ad-hoc 'shared context locals gets defaulted on suspend'.

The thing that gives me most hope is that it appears this design variant actually lets us stick with just the shared-suspendable semantics (maybe supplemented with shared-barrier, although it seems we can get surprisingly far without it`).

Iterating a little further - it probably helps V8's desired inlining optimisations if at least some context locals can be marked immutable. This would mean that code could be speculatively optimised to assume a particular context (and hence particular JS context_calls could be inlined), and slow-path tests (in the case of a different context) would only be needed at JS->Wasm boundaries and calls within context_switch blocks.

EDIT: and maybe on the JS side you'd want to eagerly bind (at least the immutable parts of) a context to a function during thread setup in order to make this test particularly fast

conrad-watt commented 7 months ago

One thing I'm trying to work through mentally WRT context locals is reentrancy. Is it ever expected that one might want to have a call graph that looks like Wasm(1)->JS->Wasm(2) and have Wasm(2) know where the stack pointer is from Wasm(1)'s perspective? If so, some care is needed to thread the stack pointer (held in Wasm(1)'s context local) through the JS frame so that it can be put in Wasm(2)'s context local. I think it's doable with careful ABI arrangement, and in any case the same concerns would occur in the naive "stack pointer in function argument" solution.

tlively commented 7 months ago

Is it ever expected that one might want to have a call graph that looks like Wasm(1)->JS->Wasm(2) and have Wasm(2) know where the stack pointer is from Wasm(1)'s perspective?

Yes, this pattern is extremely common in Emscripten.

eqrion commented 7 months ago

@eqrion, would you be ok with the strong shared-to-unshared GC semantics with no guarantee of cycle collection? (Or perhaps a promise of no cycle collection for compatibility?) This would be equivalent to the semantics of supporting shared-to-unshared edges in FinalizationRegistry but not WeakMap, as we discovered with @lukewagner in the last meeting.

I would strongly prefer that semantics to the weak semantics.

That would be enough to let us have thread-local functions callable as normal (shared) imports and thread-bound data to let shared functions pass arbitrary JS objects around as (shared externref).

I've thought about this a bit more and here's some more thinking on this.

I think some form of 'strong' semantics (a shared-to-unshared edge can keep an unshared thing alive even if the shared thing is only reachable from a different thread) is necessary for the ThreadBoundBox idea to be useful.

With 'weak' semantics, toolchains would have to have the unshared thread bound value be rooted in some unshared JS value in addition to the shared data structure, which would then be equivalent (from a lifetime perspective) to just storing a handle in the shared value that references the rooted unshared value and having manual memory management. At the cost of GC engine support that I don' think would give you much ergonomic advantage over using handles.

With the 'strong' semantics that do not have a guarantee of cycle collection (or promise no cycle collection), my concern is about how this appears to users. If we require developers to use FinalizationRegistry and post-mortem callbacks, it seems clear to users that they are doing some manual memory management and should be careful about leaks/cycles. If we're creating a new kind of GC edge that doesn't collect cycles, that expectation is not present, as it just looks like every other kind of reference.

The above is mostly in reference to ThreadBoundBox but it seems like it should also apply to thread-local function unless there is some argument for why it should be an exception.

tlively commented 7 months ago

I agree that weak semantics are no better than storing rooted handles like linear memory languages must do today. I don't think that is sufficient to meet the needs of languages targeting WasmGC.

That leaves us with strong semantics with or without cycle collection. I agree that there are risks of surprising memory leaks if we don't support cycle collection, so I would prefer to support cycle collection if we can get away with it technically.