Proposal: Linear-Memory GC-Root Marking

A frequent request has been to provide better support for implementing garbage collection in linear memory. At present, the main obstacle is the inability to scan the WebAssembly stack for GC roots, as is standard in GC implementations. This proposal outlines a way to efficiently, safely, and compactly support such functionality.

Illustrative Example

The following is the outline of a module using the new proposal to scan the stack for linear-memory GC roots.

(module
  (memory ...)
  (local-mark $gc_root i32)
  ...
  (func $example_method_implementation
    (param $this_pointer i32)
    (local $array_index i32)
    (local $array_pointer i32)
    (marked-locals $gc_root $this_pointer $array_pointer) ; not $array_index
    ... ;; instructions implementing method body
  )
  ...
  (func $mark_gc_root
    (param $gc_root i32)
    ... ;; instructions for the GC's gc-root-marking process
  )
  ...
  (func $scan_for_gc_roots
    (enumerate-marked-locals $gc_root $repeat_with_next_marked_local
      (call $mark_gc_root) ;; the value of the marked local is on the stack
      (br $repeat_with_next_marked_local)
    end)
  )
  ...
)

In this module, there are three new constructs:

local-mark: part of the tag section; declares a new mark (whose index is referenced by $gc_root) that can be attributed to locals of type i32
marked-locals: optionally immediately follows the declaration of locals, specifying a mark (in this case $gc_root) and a list of locals the mark should be applied to (in this case $this_pointer and $array_pointer)
enumerate_marked_locals: an instruction block whose body is executed with the value of the most immediately enclosing local marked with the specified mark (in this case $gc_root); the instruction introduces a label (referenced by $repeat_with_next_marked_local) that is branched to in order to repeat the body with the next-most enclosing $gc_root-marked local (exiting the block if there isn't one)

Taken all together, the effect of these constructs is that, if $scan_for_gc_roots is called during the execution of $example_method_implementation, then eventually the values of $this_pointer and $array_pointer will each be passed as the argument to calls to $mark_gc_root.

Design Overview

In order to make the feature efficient and compact, the key insight is that locals in a function conceptually describe the stack frame of the function. There are safety and performance issues with letting one examine this stack frame arbitrarily, but we can address those issues by explicitly marking which locals can be examined.

The tag section is extended to include local marks (local-mark $mark (type mutable?)), which each specify the type of local the newly defined mark can apply to and can optionally indicate that the mark is mutable.

A function header, then, is extended to optionally specify a local-marking list: (mark-locals $mark local_idx+)? (where each local_idx list is strictly ascending). This list indicates which locals in the frame have been marked and with which mark. The requirement is that the type of each specified local must match the associated type of the mark (a subtype if the mark is not mutable, and exactly if the mark is mutable).

Lastly, the instruction set is extended with the instruction enumerate_marked_locals $mark instr* end : [] -> []. If t is the associated type of $mark, then the body instructions instr* must have type [t] -> []. The body instructions are also given access to a label of type [] and, if $mark is mutable, a label of type [t]; in the example, the former label is referenced by $repeat_with_next_marked_local. enumerate_marked_locals searches the stack for the first enclosing label tagged with $mark and hands its current value to instr*. If instr* branches to the provided label of type [], then the stack is searched for the next appropriate local and instr* is executed with its current value. If instr* branches to the provided label of type [t], the value of the current local at hand is updated and then the stack is searched for the next appropriate local and instr* is executed with with its current value. If at any point there are no more appropriate locals, or the end of instr* is reached, then control jumps to end, thereby exiting the loop.

Application: Garbage Collection

This feature is focused on garbage collection, for which language runtimes implemented in WebAssembly generally want root-scanning to be very efficient. The feature is designed so that no function calls or non-local control transfers are required during the scan. The feature is also designed so that it is easy for engines to easily and compactly decorate the stack with the information necessary to determine where the roots are in its representation of the stack frame, and so that engines can freely optimize unmarked locals as before. The example above illustrates how one can use this feature to implement a 32-bit garbage collector. In the case of a moving GC, one uses a mutable mark, and $scan_for_gc_roots updates the local to the new address of the data. Note that the language runtime implemented in WebAssembly is responsible for ensuring its garbage collector is run sufficiently often.

Extension: Concurrent Garbage Collection

The stack-switching proposal adds the possibility of concurrent garbage collection. To support this, this proposal would furthermore add enumerate_marked_locals_in_task/stack/fibre/continuation $gc_root instr* end : [taskref/stackref/fibreref/contref] -> [], which enumerates all the locals on a suspended stack (which is locked until the loop is exited).

Proposal Process: Phase 0

Do not take this as a final design; there are many variations on and questions for this design. This write-up is meant as a starting point should the community decide there is enough interest in the overall direction to advance the proposal to Phase 1. (And thanks to those who already provided suggestions on how to improve this write-up!) The primary purpose of this thread is to gauge interest as well as to collect high-level concerns that should be addressed or kept in consideration.

Thanks for proposing this! The design looks very interesting. Unity3D engine utilizes C# as its main language, with the Boehm GC, and there are two challenges we are facing in that arena:

since the managed pointers are wasm locals, scanning the LLVM/Emscripten stack does not do any good since there are no managed pointers there, and we can run the Boehm GC only after all managed C# functions have been returned from in the callstack (e.g. right before a requestAnimationFrame() callback is about to exit). This means GC can only work after each rendered 3D application frame, and if C# code would temporarily trash through more C# objects than there is available amount of wasm heap exists, it will OOM, even though one could have succeeded if one had been able to GC in the middle of a frame (or at an OOM event handler).
we are unable to deliver any multithreading support for the C# compiled code, because multithreaded garbage collection is not today feasible, as it would require being able to synchronously pause all the other running Web Workers that potentially could have C# code in their stack frames.

In the above issue that @kripken linked, I highlighted a manual annotation mechanism that I had observed we could play around with, which I was hoping that would be modifiable to something that would not make LLVM codegen quite so pessimistic.

That manual annotation "hack" and the proposal in this issue would both solve 1. , though I think it is critical that a solution that would scale for the multithreading problem 2. is also found. There the requirement for synchronous pausing of Web Workers is difficult.

I wonder how feasible it would be to spec that enumerate-marked-locals would traverse through marks on all wasm threads, and the execution of all other threads that do have any marks in them would then synchronously pause? Or I wonder if there might be some other mechanism that could work for the multithreaded GC scenario?

Oh, duh, I had meant to discuss the widely used Boehm-Demers-Weiser conservative garbage collector, but it slipped my mind when hastily putting together the talk.

I agree with your assessment of 1 versus 2. However, the issues for 2 are major challenges of parallel garbage collection in general. That's not to say this proposal doesn't help, but it has to be one part of a bigger solution.

For example, a very conservative approach would be to have a shared flag that gets set whenever it's time to collect garbage. Every now and then every thread checks this flag and pauses itself when it's set. Once all threads have done so, then you run the "scan and mark roots" function on each of them, and then you run the garbage collector, and then you resume the threads.

In reality, you can likely do much of that in parallel. I'm just illustrating how to deal with a worst-case scenario. In better scenarios, when a thread sees the flag it simply runs the "scan and mark roots" function, reports that it has done so, and then keeps going; the GC then runs in parallel after all threads have reported in. How much parallelism is possible all depends on the details of the garbage collector. I highly suspect the Boehm-Demers-Weiser garbage collector supports much more parallelism, but I'd have to dig into its documentation to determine the specifics.

Does that make sense, @juj?

a very conservative approach would be to have a shared flag that gets set whenever it's time to collect garbage. Every now and then every thread checks this flag and pauses itself when it's set.

This scheme is something that I think is called cooperative signaling. (or effectively the same kind of feature)

For posterity, in comparison, native platforms support pre-emptive signaling, i.e. sending a signal from one thread to another thread (via registering a signal handler using the POSIX sigaction() function, and raising signals via pthread_kill() function).

I think for most platforms, Boehm uses POSIX signals in its "stop-the-world" implementation for multithreaded collection.

I authored the original implementation of pthreads to Emscripten, along with the cancellation points feature, which uses a similar cooperative polling model. Though the vocabulary in that standard allows one to be much more coarse about where such cooperative cancellation point checks should be placed, since the pthreads library was standardized without an expectation of pre-emptive signaling to be supported in hardware.

From that background experience, I have been considering implementing a full cooperative signaling support for Emscripten, with the idea that codegen would emit cooperative checks for these kinds of flags "every once in a while". (whether there is one global process wide "GC now" flag vs thread-local "do-I-have-a-signal" flags to pause each does not complicate the story much)

However the realization there is that while such cooperative signaling would definitely work already today (with shadow stack management), it is much more demanding to implement in terms of the bloat it adds to generated wasm code size, and the degradation in runtime performance that results.

This is because we want to utilize this type of signaling feature also in real-time 3D rendering contexts, so the aspect of properly timing every now and then becomes critical. When we want to pause the other threads, all threads should respond well within a few milliseconds to do so, so that the overall application rendering will not stutter due to dropped rendered frames. (A typical frame takes about 8ms-16ms to render, so stalls longer than that due to "reining in" all the background threads would be detrimental).

To make such requirements possible to meet, it is likely that all loop back edges and entry points of large enough functions should emit a check for the cooperative signaling, so that it is not possible for any single thread to temporarily "get lost", and the more threads there are, the slower "gathering the herd" for safe garbage collection will run.

In better scenarios, when a thread sees the flag it simply runs the "scan and mark roots" function, reports that it has done so, and then keeps going; the GC then runs in parallel after all threads have reported in

I am not aware that Boehm would allow background threads to continue running immediately after they have reported in, even if other threads haven't. Though even if that was the case, the original issue remains that the initial stopping the world would be most preferred to happen quickly for the collecting thread (which is typically the main thread).

Overall, large wasm applications are moving to multithreaded Wasm in the future, so if there is a feature baked in to wasm for helping GC, it would be great to see something that would have a path towards enabling multithreaded Boehm as well, in addition to helping singlethreaded GC?

Fwiw, after reading the meeting notes, as a potential data point to inform a potential proposal, with Wasm GC still in the pipeline AS currently bundles a mark and sweep GC with two possible modes:

Incremental: With a shadow stack in linear memory, so the GC can step incrementally while the program is executing
Minimal: Without a shadow stack, with manually invoked collection when the stack is know to be unwound

The GC is precise, only putting values of locals containing actual references onto the shadow stack. Marking visits all globals containing references, plus the values of locals containing references. The former uses a generated function aware of which globals contain references, the latter iterates over the currently relevant region of the shadow stack. From there on, there is one generated function per unique class aware of which memory addresses (containing references) given a pointer to an instance to visit. Maintaining the shadow stack requires instrumenting any function that uses a local containing a reference, when entering decrementing the stack pointer by a pre-computed frame size (and zeroing the respective frame in memory since the GC is precise), when returning, throwing or leaving incrementing it again, plus spilling the values of the respective locals. One particular annoying aspect of implementing this manually is in combination with exceptions, which might roundtrip to JS so, in the best case, require a bunch of try-catch_alls around externally callable functions to fix-up the stack pointer when a function is left early (because the landing pad, which would otherwise fix-up the stack pointer, might not be under the generator's control). Mentioning in case such an existing implementation can help to inform the design, but I should note that AS is ultimately eying the GC proposal instead.

Thanks, @juj, for the great research, explanation, and analysis! And sorry for my delayed response.

I agree with your analysis of cooperative signaling. Unfortunately, with the constrains wasm has to work under, I can't think of a great way to make preemptive signaling work for this (even if wasm had support for preemptive signaling). The issue is instruction sequences such as local.set $local_reference (call $return_some_reference). If the preemption happens between these instructions, the $scan_for_gc_roots operation will miss the reference returned by $return_some_reference as it's no longer in the frame for $return_some_reference and it hasn't yet been assigned to $local_reference. This isn't just an instruction-design issue, it's an issue at the assembly level too.

Conceptually, if this happens, the returned reference can be in some register when the preemption happens. With native signals, the signal handler running the root scanner can examine every register and every value on the stack. With WebAssembly, we can't do that for a variety of reasons. And, because the process suspending the thread in order to run the signal doesn't know which registers are "marked", there's not a great way to examine just the marked registers (or make sure marked registers get put into marked stack slots when suspending the thread). I guess it's possible to maintain a code-address map that says "if the pc was here then these registers are marked", but that would likely have to be a very fine-grained map, taking a lot of space (and creating a lot of burden to generate).

So, while I wouldn't say it's impossible to make preemptive root scanning work, given what exists now in wasm (e.g. no preemptive signals) and the effort this would require, I suspect we're stuck with cooperative signaling for root scanning (which, as you say, does seem to work for this proposal as is, albeit by requiring many frequent check-in points). But I'm definitely interested in hearing suggestions if you have them!

Thanks, @dcodeIO, for the detailed explanation of what AS is doing right now! I hadn't thought about the issue with unwinding, which does seem to add even more baggage to shadow stacks.

I can't think of a great way to make preemptive signaling work for this

I suspect we're stuck with cooperative signaling for root scanning (which, as you say, does seem to work for this proposal as is, albeit by requiring many frequent check-in points)

Both the manual stack marking (as per https://github.com/emscripten-core/emscripten/issues/17131 ) and cooperative signaling do "work" today without any changes to the wasm spec, though I still think that both will be disk size and performance prohibitive to be deployed on live web sites (manual stack marking less so than cooperative signaling).

We are having conversations at Unity about spinning up a project to prototype what the associated costs here are exactly, to back the results with numbers - though implementing these into Unity will not be an easy undertaking.

Solving multithreaded stack scanning will be a completely different scale of problem than enabling just singlethreaded stack scanning. Whatever new language features would be added, would be good to be designed with (eventual) possibility of working in multithreaded programs as well.

We are having conversations at Unity about spinning up a project to prototype what the associated costs here are exactly, to back the results with numbers - though implementing these into Unity will not be an easy undertaking.

I certainly understand this would be a lot of effort. That said, these kind of numbers have previously had a big effect on getting features advanced and implemented.

I raised the issue about preemption at the in-person meeting. No one had any ideas to offer, but I came up with one recently. Let me run it by you.

At least V8, and I presume other engines, has a means to deal with interrupts. In particular, V8 converts an interrupt into a cooperative signal by setting a flag. That flag is checked at every function call and at every loop iteration. So my suggestion would be to propose a way to piggy-back that existing check. For example, a block instruction call_on_raised $flag instr* end could specify that $handler : [] -> [] should be called immediately prior to certain instructions (e.g. calls and the bodies of loops) if the shared global $flag : i32 is non-zero. Then, whenever an engine checks its flag from a place within a call_on_raised block, after doing its own work it can check $flag and call $handler if it is set.

Because the semantics would be to call $handler at certain instructions, someone using this proposal would be able to ensure values are in the appropriate marked locals at those cites, addressing the aforementioned issues with true preemption.

@juj Do you think something like that would address your needs?

the gc linear memory root marking is interesting for wasm i already run net wide memory grids in the browser based on concept of persistent loaded memory and throwing away like gc what is not loaded i designed a new kind of ffi to make that hapen where i do not do ffi i throw in a empty context and use it like shared memory let the function do its thing on the context directly so like working with handels and do not transfer data that leads to a net wide in memory grid program able via import export operators in ECMAScript i call that web 4.0 as it does not need html

the linked isolates and context elements like document implementations do supply me with independent context objects where i can map

on the v8 stack every context builds a simple stack that gets a context security id see the mksnapshot format

if you use only streams they solve well the problems of locking and mutx

so to map the shared memory of two wasm modules it looks like having 4 files 2 modules that you import and the 2 modules do each export a wasm instance then you can link them in memory as they would be already aligned in a 3rd context where you import that 2 modules.

i did a lot of compiler research and audits of the binary snapshot results of the engine and in general we are already there. as we work with c pointers anyway on the low level

I have now had time to implement an Emscripten/WebAssembly-specific multithreaded stop-the-world mark-and-sweep GC as a test project: Emgc

It is rudimentary, and only now tested in small synthetic tests, though there are some observations that I've gained from implementing this.

1. Prohibitive disk size of pointer spilling?

Currently Emgc relies on custom pointer spilling pass in Binaryen Wasm optimizer, that spills all wasm i32 locals on stack right before any function call. The pass is overconservative, and there are some optimizations that could be done, e.g. by transitively analyzing which function calls can lead to garbage collection. Although I am getting a feeling that even with optimizations, emitting manual i32 spilling in user code will be prohibitive on .wasm disk size. Conclusion here is that this Linear-Memory GC-Root Marking proposal from @RossTate could really help avoid all of it.

2. Cost of `marked-local`s?

Then questions about this Linear-Memory GC-Root Marking proposal come to mind: what would the expected costs be for a local to be a marked-local in a VM that implements this proposal?

Note that currently, the Binaryen pointer spilling pass is not able to distinguish any managed pointers from native pointers, or even any pointers from integers, leading to lavishly exaggerated over-spilling of all i32s over to the LLVM stack.

Would the disk size cost of a marked-local annotation under this proposal be negligible? How much would marked-local flagging affect the .wasm disk size if the producer was very over-conservative and flagged *all* i32s in *all* functions?
What happens at runtime in a Wasm VM implementation for marked-locals? Would VMs be looking to implement those under the hood as a similar (eager) shadow stack spilling? (so each marked-local has a perf cost even when not GCing)? Or is there a way to implement this proposal as a no-cost thing, and all the work would be concentrated only in the marked-local scanning function?

The observation I find here is that no matter which solution we'd go with: user code stack spilling or this proposal, both would benefit from accurate pointer vs integer identification in LLVM/Binaryen toolchains. Or rewording in another way: this proposal would not be able to remove the need to develop pointer vs integer identification into the compiler toolchain?

3. Gathering-the-herd overhead

I am implementing a naive cooperative GC checkpoint in all loops into a Binaryen. This pass emits calls to a multithreaded sync checkpoint that starts a GC mark-and-sweep in all managed worker threads when needed.

This works out well, although the observations are quite similar as with user code pointer spilling: the code disk size costs are noticeable. I don't have exact data yet on large codebases, I am working on that. Though the expectation is that the code disk size impact will be to a detriment.

A cheap(?) way to mitigate this kind of code size problem might be to spec a boolean identifier to attach to specific wasm functions, that would tell the wasm VM to codegen the functions with this kind of a GC check inside the loops on-the-fly. Then the disk size impact would be minimized. The details would need to be expanded; this is assuming that this kind of cooperative polling type of gathering-the-managed-thread-herd would be judged to be good enough to run at performance in real world scenarios.

4. The Sleep Slicing problem

Ignoring the code size impact of gathering the herd for multithreaded GC applications, there is another somewhat nasty problem that I find associated with it, and that is futexes&sleeping.

The problem here is that if a managed thread has GC pointers on its stack and it would like to go to sleep to wait for a futex, e.g. to receive a task from a task queue to work on; if implemented in a straightforward fashion, it might never wake up on demand to join the GC herd.

To avoid this problem, all such futex/sleep code will need to be time sliced to small quantas. But this is really gross, and becomes a hard problem of choosing what the right time slice quantum should be, a tradeoff between wasting joules of energy vs chasing responsiveness and performance - an especially tricky situation for games on mobile devices.

Modern game engines have lots of mostly-dormant background worker threads in them for specific tasks, and then there's that heavy main pool of actively running worker threads for the game code. (pool the size of # of logical cores on the system). In short, there are way more threads in a program than logical cores. It would probably be rough to have all those background worker threads to be constantly pulsing to check if they need to gather into the herd, since that would mean that they would take away logical core residency performance from the thread pool that is doing actual work, even when no GCs are running.

In the emulated stack spilling code, I realize that I could avoid this kind of sleep slicing by decorating futex waits with a "release my stack to someone else to mark, I'm going to sleep" type of construct. I.e. before waiting on a futex or sleeping, the code would hand off its own stack as a root into some global data structure that some other GC thread will then mark. The worker thread could then futex/sleep for the full non-timesliced duration, and after that pull their own stack back from that global structure and resume.

That would enable the thread to futex/sleep for the full duration, without needing to slice the wait up and burn joules, very nice for battery.

However, such a solution would not be compatible with this Linear-Memory GC-Root Marking proposal. Under this proposal, every worker can only scan their own stacks, other workers can't scan the stacks on other workers' behalf.

So under this proposal, it would require threads to either slice up futex/sleep time periods so they'll be responsive to gather in the herd, or they would need to eagerly scan their stacks before going to sleep - which also seems like a hard ask.

I wonder if this proposal would be somehow compatible with a "hand-my wasm locals off to another Worker to traverse" type of idea? Then some other Worker could run enumerate-marked-locals on the sleeping Worker's behalf? Or would there exist some other construct that would essentially achieve the same? The key here would be to have a solution that would let worker threads futex/sleep for their full durations instead of pulsing?

This is awesome stuff, @juj! Thanks for the massive effort!

However, such a solution would not be compatible with this Linear-Memory GC-Root Marking proposal. Under this proposal, every worker can only scan their own stacks, other workers can't scan the stacks on other workers' behalf. ... Then some other Worker could run enumerate-marked-locals on the sleeping Worker's behalf?

Yes, I was worried about this too. Part of why I haven't pushed hard on this is because I saw the further-ahead stack-switching proposal would need to be designed to be amenable to this, so I shifted my priorities there for the time being. Without that infrastructure in place, this proposal will be stuck with having to wake up a worker in order to traverse it, which seems likely to be impractically inefficient.

Another issue that has been raised with this design is that it forces the inspectable-stack-frame size to stay the worst-case size across all calls. So if a lot of inspectable locals need to be live at one call, they have to stay inspectable across all other calls as well. This may not be an issue for C/C++ programs, which tend to have shallow stacks, but languages that encourage lots of recursion need to be conscious of their stack-frame size.

WebAssembly / design