WebAssembly / gc

Branch of the spec repo scoped to discussion of GC integration in WebAssembly
https://webassembly.github.io/gc/
Other
1k stars 72 forks source link

Sketch: Linear memory GC with IT #78

Closed aardappel closed 2 years ago

aardappel commented 4 years ago

This sketches an alternative way in which GC could in theory work in Wasm. It is probably crazy, but bear with me, it may have some merit.

NOTE: this is NOT meant to be a "proposal", it is merely a sketch of an idea with most details left open, that puts emphasis on different advantages compared to the existing proposal(s), intended for discussion. If some of these aspects are deemed important, then maybe they can inform future GC features.

I will focus on these aspects of what would make for a good GC provided out of the box by Wasm:

  1. Allow modules to have access to an industrial strength GC at zero code size cost.
  2. Allow different languages (and the host) to exchange data easily.
  3. Keep cost of GC as local and scalable as possible.
  4. Allow existing language runtimes/codegen to interact with the Wasm GC.

This proposal keeps (1), arrives at (2) thru different means, and improves upon (3) & (4) compared to the existing GC proposals.

The basics of the idea are simple: A linear memory program/runtime assigns certain 64KB pages to the Wasm GC. The Wasm GC manages this memory as it sees fit. The linear program can address this memory, but only has minimal guarantees on what layout to expect. Inter module communication uses Interface Types, not anyref. A special kind of i32 provides exact roots.

Now, the details. This trivially retains (1), so let's tackle (2) & (3) first:

Inter-module data exchange.

The existing GC proposal(s) assume it is desirable that different modules can share references to GC objects freely, without incurring the cost of copying, and with the benefit of being able to collect inter-module cycles trivially. There are however issues with this approach:

So instead, we'll use Interface Types. The potential for Interface Types (with Shared-Nothing-Linking) to revolutionize how we build software out of parts, and how we can work with different languages is huge. If most interfacing with GC language modules is going to go over IT, we might as well go all the way, and get some of the benefits this would bring:

Now for (4):

Runtime interaction.

The vast majority of current language implementation come with some form of "runtime", usually written in C or C++. Besides the GC algorithm (which we're trying to save them from having to ship), this can include other support functions that implement the runtime semantics of the language, implementations of the built-in functionality, and general standard library implementations. Typically all that code will access (and create) GC objects directly. Under the existing GC proposals all this code would need an almost full rewrite from C/C++ into a language that is Wasm-GC aware (i.e. can emit Wasm-GC object access/creation opcodes directly from the language, to be efficient). There may be ways for this to retrofitted on to C/C++ (using intrinsics?) but how this would flow all the way thru the LLVM pipeline is an open question. Even if we had this functionality, this would still require a significant rewrite of a language runtime.

Will language implementers go thru such a rewrite? Or will they throw in the towel, and port their existing GC to linear memory, not using a standard Wasm GC at all?

There is generally the question of how well a standard generic Wasm GC will cater for all languages. It is well known that many languages have intricate GC designs, whose functionality intertwines with the language's promised semantics. Arguably, for the runtime to be able to interact with a Wasm GC directly in linear memory would expand the possible cases where such a GC can be used, in favor of porting an existing GC. Given that the GC is per-module, it can possibly be parametrized to fit the language, rather than collecting across languages generically, which means it cannot take anything about the language into account.

How would this GC/runtime interaction go? First, the linear memory program needs to assign pages to the GC, which it could do either manually, or we could also imagine a more managed system where the upper part of memory is always for the GC (such that the GC could grow it independently of the linear program, though possibly making memory.grow more expensive if the GC memory needs to be moved). I am sure there are many possible ways to do this, for now assume there exist linear memory addressable GC pages.

The GC would need to know the roots outside of its GC space, which I imagine would come in two forms: references on the Wasm stack, and optionally from a table managed by the linear program of additional roots. These roots can be used by the linear program as naked pointers (i32), so wouldn't be an anyref. To help the GC be exact (non-conservative), we'd introduce a new (sub)type of i32 that, when present on the Wasm stack or a table, would be a root (let's say i32r). This type would degrade to i32 when stored in (unmanaged) linear memory, and would then not guarantee the pointee to be retained (it is a "untraced ref" that can only be used if the linear program guarantees there is at least one active i32r on the stack/tables or i32 in the GC pages).

Existing load/store ops can be used to access these objects, using either i32 from linear memory or i32r from the stack. We'd define what a GC struct and an array must look like in the GC pages of linear memory, which given their low level nature, doesn't sound too much of a restriction to GC implementation (any additional data a GC wishes to store can be at negative offsets).

The GC makes NO promises as to the state of memory outside of well defined parts of structs and arrays, much like it is already undefined behavior to even access memory outside of a malloc returned block in C/C++.

i32r values can possibly be rewritten by the GC to point elsewhere, invalidating any "untraced ref" copies in linear memory. For this to function there needs to be a minimal level of cooperation between the linear memory program and the GC as to when these changes may happen. How, TBD.

There is of course the ability for the linear program to access GC memory, and even corrupt it. This does not seem like a problem from a language point of view: almost all programming languages "trust" their runtime code already, and all have the potential for bugs in said code to disrupt the otherwise safe execution of the language. Given that this is now isolated to a single module, the impact of this is smaller than if there was a bug in a Wasm runtime, affecting all languages within.

It does have the downside of requiring a slightly more conservative Wasm GC implementation, i.e. i32s loaded from linear memory will need to be bounds-checked, whereas the existing GC proposals can directly treat anyref as a pointer. I think this is a small cost compared to the potential for optimisation this new system otherwise brings (see above). It generally goes along with the existing Wasm philosophy of intra-module trusted, inter-module untrusted which IT builds on, and which existing GC proposals complicate.

A final problem may be non-determinism (but only for "buggy" programs): access to GC memory outside of designated objects could result in different data under different Wasm implementations. We'd have to specify that this is undefined behavior?

kripken commented 4 years ago

I think this is very interesting! I like how it leaves cross-language stuff to Interface Types, and how it might help port VMs.

To make sure I fully understand: is the idea that the compiled VM, when it allocates GC objects, allocates them in these special GC pages. Then when the wasm VM does a GC, it traces from the roots through objects in those pages, using a layout that is agreed upon (I assume that's what you mean in "We'd define what a GC struct and an array must look like in the GC pages of linear memory"). Then it would somehow inform the compiled VM of which objects can be freed, but not touch memory itself?

If that's generally correct, it sounds good to me!

Some questions:

RossTate commented 4 years ago

This is definitely an interesting idea, but after playing with it for a while it needs some fixing up.

First of all, it's not enough to just know the roots. You need to be able to walk the gc-allocated objects and determine what is transitively reachable from the roots. But to do that you need to be able to distinguish references from integers. So the first change is that you need all of this gc memory to be i31rs so that the gc can reserve a bit to distinguish pointers and data. But even that's not enough, since you're relying on adding integer offsets to references. So the engine needs to know, given an i31r, what range of addresses it points to. The simplest way to do this is to allocate arrays of i31rs, rather than preset page sizes, and have an i31r always reference the head of the array. Now you have something that's precisely garbage collectable. Furthermore, it no longer needs to be the case that a reference is an integer, so its semantics can be deterministic.

Of course, there's a big downside, which is that i32/f32 fields would have to be split across two i31r fields, and similarly i64/f64 would have to be split across three i31r fields, though one could consolidate the overflow bits of these fields into one i31r field via some bit arithmetic. A smaller downside is that it's no longer possible to support efficient CAS on i32 fields, which a number of concurrent data structures rely on (if ever we get to that point).

Given that, an alternative design is to, at each allocation, specify which indices in the array are references and which are i32s. In addition to storing the size of the array in some header info, the GC could store a bitset tracking this invariant. Then whenever we access/mutate an index in a mixedref, we specify whether it is supposed to be an i32 or a mixedref index, and the GC dynamically checks this expectation with the bitset in the header.

In both these designs, we're stilling using integers to index into things like the funcref table or the "external reference" table, so it will still be impossible to detect cycles across these different kinds of references. Similarly, we'll have to add an i31r/mixedref table for anchoring references that linear memory has a handle on (via an index into this i31r/mixedref table).

How do those patched designs sound to you @aardappel?

skuzmich commented 4 years ago

Inter-module data exchange.

Couldn't all this (local GC scope, API stability, leakage prevention) be replicated in current proposal by just not sharing references across modules? I feel like you can have all the benefits but without field access bound checks, and with ability to not use this restricted mode and share objects away.

Runtime interaction.

On the flip side, there are a lot of language implementations that don't compile to machine code ahead of time. They use built-in mechanisms of virtual machines like JS or JVM to allocate memory for them. They would have to start maintaining C/C++ runtime if you require them to deal with linear memory.

lars-t-hansen commented 4 years ago

I've been thinking about this problem too and I'm glad to see I'm not alone :-) Speaking narrowly about the GC implementation bits, some attempts to summarize and comment:

(The use of the wording "weak ref" is also unfortunate; "untraced ref" might be better.)

Given that we store i32r as i32 there will need to be some kind of downcast functionality for i32 -> i32r (whether a cast instruction or something embedded in a load instruction; it amounts to the same thing). Thus it is possible for some references to live as non-references in locals / on the eval stack for stretches of the program; this ties into the need for a robust safepoint mechanism.

RossTate commented 4 years ago

Will language implementers go thru such a rewrite? Or will they throw in the towel, and port their existing GC to linear memory, not using a standard Wasm GC at all?

To bolster @skuzmich's point above, it is unrealistic to expect this code to work in wasm. Much of the complexities of run-time systems have to do with features that wasm either doesn't support or takes over, such as JITing or dynamic loading of code. Even your lower-level proposal here doesn't support the functionality I've used for my toy research languages, so I'd have to make a separate wasm implementation for them anyways rather than our current implementation using LLVM.

GC being a per-module thing may have great performance benefits.

From what I understand, y'all want wasm modules to be able to have references to JS/DOM values. Once this is possible then you now need a multi-module GC. Similarly, if there is any way to import/export function pointers that are closures, then again you need multi-module GC.

Each module can have a lot of freedom in designing (and changing) their data layout, since there are no constraints that we'd try to match something other languages are doing for exchange reasons, or having to stay compatible over time.

This seems to be more of an argument against anyref than multi-module GC.

Emphasis on copying rather than sharing in API design will in general result in simpler, more general APIs.

Yes, but we should not throw out sharing of mutable data structures or large immutable data structures altogether. For example, with GC it would be easy for WASI to provide an mmap function that returns a GC array of i32s, which msync then flushes any mutations of to file.

Note that none of these comments are about the proposed design sketch; I'm just trying to clarify the motivation here.

aardappel commented 4 years ago

@kripken Yes, that is mostly correct, except that the "free-ing" is something that the Wasm GC can do itself, since it is entirely in control of what gets stored where in the GC pages. Thus the linear program doesn't need to be informed.

Your Qs:

aardappel commented 4 years ago

@RossTate I guess I wasn't clear in that I was assuming we'd have typed definitions of structs and arrays similar to the current proposals, such that once the GC knows it is looking at a particular type of struct in the GC pages, it also knows how to distinguish between refs and scalars inside of it (see also my reply to @kripken above).

What you are talking about would be a design where we don't have to rely on such definitions, but the GC would work generically on blocks of N refs (and M scalars, possibly always in that order) which could be even more flexible, especially for more dynamic languages that may decide how to layout an object at runtime. This may well be superior given that we can't use static type information entirely statically anyway. Minimum header information would be storing N and M, and it wouldn't have to distinguish between structs and arrays?

I generally like this direction, yes, but I'll need to think about the consequences some more.

aardappel commented 4 years ago

@skuzmich Yup, if from the above discussion we conclude that "local GC has many benefits over global GC", and we enable that optionally, by default, or even always in the existing proposal, that would be a great outcome.

Re: not compiling to machine code ahead of time.. Wasm requires that currently (we have no known viable JIT path yet). Every Wasm program will always have some linear memory, and I don't think "allocate and manage an object for me" is much different in the existing GC proposal or what I am sketching here.

aardappel commented 4 years ago

@lars-t-hansen:

(I'll edit the original post to use "untraced ref")

aardappel commented 4 years ago

@RossTate

To bolster @skuzmich's point above, it is unrealistic to expect this code to work in wasm. Much of the complexities of run-time systems have to do with features that wasm either doesn't support or takes over, such as JITing or dynamic loading of code. Even your lower-level proposal here doesn't support the functionality I've used for my toy research languages, so I'd have to make a separate wasm implementation for them anyways rather than our current implementation using LLVM.

Essentially you are saying languages will need a rewrite of their runtime anyway to target Wasm? And what would you write that new runtime in? (since it needs some combination of minimum overhead while still being GC instruction aware).

If you're right, then indeed the argument for direct linear program <-> GC interop goes away. It is my impression that it will not be so easy to replace existing runtimes, however.

From what I understand, y'all want wasm modules to be able to have references to JS/DOM values. Once this is possible then you now need a multi-module GC. Similarly, if there is any way to import/export function pointers that are closures, then again you need multi-module GC.

It is all about defaults. If even the smallest objects in each module, and each language, and the host can all arbitrarily refer to eachother, then yes, you need a global GC for cycles to not become a huge pain. I feel this is the wrong granularity of sharing though, and thus the wrong default. If the default is (semantic) copying, and the vast majority of cross module exchanges happen this way, then the opportunity for cycles we're not aware of is vastly reduced (if cross module refs are something you need to explicitly opt-in to). We still need such refs, especially for big expensive objects like textures etc, but these would be much more explicitly managed. Giving even the smallest struct this power is lifting the pains of sharing and leaks to the level of a many-module soup, where it is hard to know who is responsible for the problems.

Yes, but we should not throw out sharing of mutable data structures or large immutable data structures altogether. For example, with GC it would be easy for WASI to provide an mmap function that returns a GC array of i32s, which msync then flushes any mutations of to file.

Agreed (see above).

aardappel commented 4 years ago

From feedback I've gotten so far this idea really contains 2 parts, which could really be considered:

  1. The idea that each module and/or memory having its own separate GC space (with exchange of data across them going almost entirely through Interface Types) has advantages over the assumed global GC of current proposals.
  2. If you have (1), it now becomes feasible to expose GC memory as part of linear memory, allowing runtime linear memory code to interact more naturally with the GC.

It seems so far that (1) may have more merit, and that (2) has some tricky issue to solve, and potentially some bigger downsides as a consequence.

It may be worth discussing these separately, but for now maybe it is enough to be aware that (1) does not need to imply (2).

skuzmich commented 4 years ago

@aardappel there are different types of WebAssembly ports with vastly different needs. From what I see, there are two sides of the spectrum of approaches that can pursued by language implementers when targeting Wasm:

1) Port the whole native language platform. This would allow to run existing libraries and applications as is. Most language implementations would pay some overhead because they were not designed for Web requirements of small code size, soundness validation on client side, and JS interop. Implementations would not want to rewrite runtime, including their own implementation of GC. Shipping the actual GC code may not become the major problem compared to the size of the rest of runtime. They would use the same low level tool-chain, like LLVM, as they do for native code.

2) Adapt to Web platform by sacrificing some of the original platform features. Only very portable or conditionally compiled code can be shared between original platform and Wasm. Runtime library would be small or nonexistent. GC would be used everywhere instead of linear memory (maybe except some static data). Instead of LLVM they would use code generator with appropriate level of abstraction (Binaryen?).

In case of Kotlin, we have a partial implementation of (1), but we are moving towards (2). In the new prototype, compiler backend is rewritten to target Wasm specifically. Small runtime library is implemented in Kotlin itself with some Wasm-specific intrinsics.

RossTate commented 4 years ago

And what would you write that new runtime in? (since it needs some combination of minimum overhead while still being GC instruction aware).

This is something I have been wondering about. See below for one thought on this.

If you're right, then indeed the argument for direct linear program <-> GC interop goes away. It is my impression that it will not be so easy to replace existing runtimes, however.

I wouldn't say it goes away. My sense is that it is moved. There's no reason linear memory and GC cannot interop in either of the GC proposals. Wasm can have both an i32 and a ref $scheme on the stack, and one instruction can use the former to index into linear memory, followed by another instruction that can use the latter to lookup some field. A program might need some table of gcref $schemes to store some references in a way that is indexable by i32s, and even that could just be a global gcref variable pointing to an array of ref $schemes, but that's about all that has to be added to wasm itself to enable interop. The real issue is generation and tooling.

What would be great is for someone to make a C/C++ library of macros or primitive types/operations that get translated down to new LLVM types/operations and then to GC types and operations. Then I could write C/C++ code that actually operates on GC references. If I'm careful to only use variables on the stack, I could even make the compiled code not need linear memory.

So my sense is that your point (2) is really a tooling problem and not a gc-wasm problem.

tlively commented 4 years ago

What would be great is for someone to make a C/C++ library of macros or primitive types/operations that get translated down to new LLVM types/operations and then to GC types and operations. Then I could write C/C++ code that actually operates on GC references. If I'm careful to only use variables on the stack, I could even make the compiled code not need linear memory.

This is something we've been thinking about. At the very least we will want to make it possible to write .s assembly files that use GC types and instructions in functions that can be passed to clang and linked with normal C/C++ code. If we can solve the problem of modeling GC types in LLVM IR and other layers of LLVM, I would also like to expose as much of the proposal as possible as source-level intrinsics and primitive types. But I would rather not make changes to LLVM IR itself, for example by adding new types, so this is going to be an interesting problem.

Whether interop requires hand-written assembly or calling intrinsic functions, there will still be significant porting effort required.

aardappel commented 4 years ago

@RossTate

So my sense is that your point (2) is really a tooling problem and not a gc-wasm problem.

If you're just loading a the value of a field, and that code has to be hand-crafted anyway since there is no way to do it in unmodified C/C++, then whether it is done with an existing load instruction to linear memory or a new instruction to outside linear memory doesn't make a lot of difference, indeed.

But you can do a lot more when things are linear memory addressable, including passing on (parts of) that memory to existing C/C++ code. New field loading instructions would have to create/allocate copies in linear memory to allow that, and while copies at the granularity of module boundaries is probably mostly fine, a lot of runtime code is extremely performance sensitive. Besides efficiency, copying also loses identity/mutability which may matter in some cases.

To me the biggest difference is thus not in tooling, but in the runtime. It may also make tooling easier, for example in the existing proposal the tools will have to guarantee that under no circumstances an anyref may land in linear memory, which will be challenging given LLVM's design. This can be somewhat relaxed if the GC is in linear memory, though there are challenges there too.

RossTate commented 4 years ago

I put together a primer on low-level stack and control-flow primitives. One of the first examples is how to support marking of GC roots, in case y'all are interested. WebAssembly/exception-handling#105

aardappel commented 4 years ago

To be clear, what @RossTate is proposing in https://github.com/WebAssembly/exception-handling/issues/105 would allow more efficient linear memory GC with a GC written in Wasm itself, whereas the current issue allows more efficient linear memory GC with a GC provided by the engine. Both ideas share reasons why having GC objects in linear memory might be desirable (access by the language runtime, etc).

The diversity of current GC implementations and language implementations is enormous, so I personally think both these options are worth pursuing, and possibly providing in parallel. Offering more options to language implementers is what will make the Wasm eco-system thrive.

aardappel commented 4 years ago

One thought on forgeability of GC roots: The current issue proposes that roots on the stack/local would be exact, but since GC linear memory is accessible by regular stores, a GC would need to do bounds/type checks to perform an actual GC. This favors the linear memory runtime having unfettered access and thus simplicity of implementation.

A different tradeoff is however possible: make the GC linear memory pages readable by the Wasm code, but writeable only by special new instructions. This would allow the GC to operate at full speed without checks. The Wasm code can perform stores on arguments directly from stack/locals (which are exact), or there may be a special cast instruction i32 -> i32r that does bounds/type checks, so the cost of this is isolated only where it is needed (when the Wasm program temporarily feels the need to store roots in linear memory, which may be unavoidable since some of this depends on compiler code generation, for e.g. reference or struct arguments).

aardappel commented 4 years ago

In the interest of tomorrow's GC meeting, for those who do not want to read all of the above, here's a shorter summary of the salient points this issue is making.

First, as apparent by the discussion, there are 2 parts to this idea of which the first part can be evaluated independently:

Idea A: Having many modules written in many different languages communicate over interface types may well be superior to holding on to each-others references.

Idea B: IF we're going to have per module GC, there is the opportunity to make GC objects linear memory addressable.

If you want to discuss the above points however, please make sure to having read the entire issue before doing so :)

tlively commented 2 years ago

Since we're now at phase 2 with the current MVP, I'll go ahead and close this.