WebAssembly / gc

Branch of the spec repo scoped to discussion of GC integration in WebAssembly
https://webassembly.github.io/gc/
Other
996 stars 71 forks source link

Mini-MVP Design #73

Closed RossTate closed 2 years ago

RossTate commented 5 years ago

Here's the Stage 1 design I had in mind for the strategy in #72. I debated making a pull request describing it in more detail, but decided an issue would be better so that we're focusing on high-level aspects rather than syntactic details. This design expects typed function references to already be in place.

First, one specifies a gc-``table'' (in the general sense of the word) whose rows are of the form:

gcidx parent slots array-slots
<int> <gcidx>? <fieldtype>* <fieldtype>+?

This table is valid if, whenever a row has a specified parent, the parent's gcidx is smaller than the row's own index (which is its own gcidx) and the appropriate subtyping/prefixing of the slots and array-slots hold. (A row can have array-slots even if its parent does not.) The gcidx of a row is not explicitly specified but instead corresponds to the row's position in the table.

Second, the grammar of types is extended to include ngcref <gcidx>, which represents a possibly null reference to a gc-structure allocated by this module instance (where gcidx is an ancestor of the value's actual index), and which is only valid if there is a row in the table corresponding to gcidx. Subtyping is extended so that ngcref <gcidx1> is a subtype of ngcref <gcidx2> if gcidx1's parent is gcidx2. The grammar of types is also extended to include nullgcref, which is a subtype of ngcref <gcidx> for all gcidx. (To clarify, these new types are usable within the fieldtype's in the gc-table.)

Third, instructions are extended to include the obvious getters and setters (and indexed getters and setters as well as array length in the case of array-slots), explicitly indexed allocation of ngcrefs, the singleton nullref value, null-checking, and equality comparison. There are also instructions for casting/testing whether a given ngcref <gcidx> value in fact has a more precise runtime index (which must be a descendent of <gcidx>).

I think that's about it. Is it perfect? No. Is it easy to translate to something better when something better comes along? Yes. Does it impose any long-term backwards compatibility constraints besides syntactic support? Not really. Are most gc languages going to be able to compile to this and get reasonable/acceptable/tolerable performance? Running through the many candidates I know, it seems so.

RossTate commented 5 years ago

Now I'll focus on what I'm proposing to remove from the current proposal.

  1. No more "greatest fixpoint" sutyping. "greatest fixpoint" subtyping should be avoided if possible since its complexity is at least quadratic rather than linear.
  2. No more dynamic creation of run-time type information (e.g. no rtt type constructor). This isn't particularly useful (I believe even C# has moved to using size-polymorphism rather than dynamic monomorphization). And by making run-time types all static the gc can optimize how the various run-time types are represented in headers and how casts are performed. For example, if the hierarchy above a given gcindex is flat, as would happen for functional languages with algebraic data types, the gc can simply use an integer in the header to indicate which case the reference belongs to (rather than having each header point to an array indicating all the gcindices the reference belongs to and implementing casts by a length check followed by an array lookup). And more...
  3. No i31ref. This is a special case of a particular implementation trick that works well for a few languages but not for many languages, and it is a hack that is completely unnecessary on 64-bit machines and fails to achieve the techniques that would be appropriate for such machines. And having i31ref be a subtype of anyref prevents engines on 32-bit machines from using other useful boxing techniques, such as nan boxing or a mixed integer-float boxing technique used in various dynamically typed languages. Instead, my proposal enables the engine to secretly box references with suitably small contents. For example, on a 64-bit machine, an engine could box all references to 32-bit contents and use the other 32 bits to indicate the gcindex of the reference (and a flag to mark it as a boxed reference). Or on a 32-bit machine, an engine could partially box a particular gcindex for 32-bit integers, boxing the value when it is a small integer and allocating the value when it is a large integer. Alternatively, the engine could use nan boxing to partially box a particular gcindex for 32-bit floats. By not having gc references all be anyrefs, my proposal enables the engine to specialize representations as appropriate for the module at hand - one can add instructions for converting gc references to anyrefs and back that can be responsible for converting specialized representations to universal representations and back. (And yes, this means anyref does not actually denote any reference, which was a misnomer I warned about.)

Thoughts? There are many other differences, but these seem like the big conceptual ones.

rossberg commented 5 years ago

There are various problems with your proposal, but one reoccurring thing you are missing is that Wasm code is not limited to a single module, and that all the mechanisms need to work under an open world assumption where modules can freely interact and additional modules can be instantiated at any time.

No more "greatest fixpoint" sutyping. "greatest fixpoint" subtyping should be avoided if possible since its complexity is at least quadratic rather than linear.

At Wasm module boundaries, i.e., for linking, you cannot avoid a structural definition of type equivalence. Wasm implements strong modularity, where modules are described solely by their imports and exports. A module, by design, cannot reference (a type from) another module. Hence nominal types are simply insufficient there: to share a type across modules, a module has to import it, but since it wants to access it structurally, this import has to be transparent, i.e., described by its structure, which then has to be matched by the linking semantics.

And since you need structural semantics anyway, adding nominal semantics as well is just an extra complication (and not a trivial one, once you get into the details). It would also break modular de/compositionality if type equivalence was defined differently inside vs outside modules.

Then there is the whole producer-side problem of compiling structural types (including basic ones like tuples) into a language that requires making everything nominal. I honestly have no idea how you would practically make that work in a modular fashion. Nominal types very much force you into a monolithic or centralised architecture.

No more dynamic creation of run-time type information (e.g. no rtt type constructor). This isn't particularly useful (I believe even C# has moved to using size-polymorphism rather than dynamic monomorphization).

The Wasm type system can never be powerful enough to be able to derive all static knowledge that a compiler has about representations everywhere. Casts seem absolutely unavoidable as an escape hatch. E.g., all the examples in the Overview use some amount of casting. You can create an ever-more complicated type system that's avoiding one or the other case, but it is principally impossible to ever be able to support everything.

And with downcasts comes the necessity to somehow represent type information at runtime. And in a low-level VM you want explicit control over any extra space or time cost this involves.

A producer is totally free to use that in a way that resembles size-polymorphism, but then it will just push certain casts elsewhere. (Often it has to check casts that a native VM doesn't have to check, because the latter can rely on global knowledge that the Wasm engine does not have because it's producer-specific. That seems to be an important point that is often missed in these discussions.)

And by making run-time types all static the gc can optimize how the various run-time types are represented in headers and how casts are performed.

The engine has all the information required to do the same under the current proposal.

For example, if the hierarchy above a given gcindex is flat, as would happen for functional languages with algebraic data types, the gc can simply use an integer in the header to indicate which case the reference belongs to

No, it can't. Indices are purely static and internal to a single module. But casts may involve comparing types from different modules. Any RTT mechanism thus requires a global type store.

No i31ref. This is a special case of a particular implementation trick that works well for a few languages but not for many languages,

We had this discussion before, but that is still incorrect. Uniform representation via pointer tagging is used by many language impls (the vast majority of dynamic or polymorphic languages), and there is no efficient substitute. Most prominently, JS engines themselves use this technique heavily. Consequently, all Web engines can trivially support it.

and it is a hack that is completely unnecessary on 64-bit machines and fails to achieve the techniques that would be appropriate for such machines.

See previous discussions. It is not ideal on 64 bit, but also not worse. That's the price for portability. Either way, it's giving producers a choice. If they want 32 bit integers instead then they can simply define a single element immutable struct and let the engine unbox that on the architectures where it can.

And having i31ref be a subtype of anyref prevents engines on 32-bit machines from using other useful boxing techniques, such as nan boxing or a mixed integer-float boxing technique used in various dynamically typed languages.

AFAICS, it doesn't prevent any such technique.

Instead, my proposal enables the engine to secretly box references with suitably small contents.

That is still true in the presence of int31ref.

Or on a 32-bit machine, an engine could partially box a particular gcindex for 32-bit integers, boxing the value when it is a small integer and allocating the value when it is a large integer.

That would significantly increase the cost of uniform representations on 32 bit platforms, possibly prohibitively so, without giving producers any way to predict or control that cost. Predictable performance is an important goal for Wasm.

Alternatively, the engine could use nan boxing to partially box a particular gcindex for 32-bit floats. By not having gc references all be anyrefs, my proposal enables the engine to specialize representations as appropriate for the module at hand

No, an engine cannot usually optimise under a closed world assumption involving only a single module, see above.

jakobkummerow commented 4 years ago

It seems to me that this proposal could be seen as a group of "patches" to the current MVP proposal, and these patches could also be discussed individually, which might help make the discussion more tractable:

(1) Require static types to list their supertype explicitly (a.k.a. switch from structural to nominal typing).

(2) Restrict allowed supertypes to those that are defined earlier in the types section.

(3) Allow types to have both struct fields and array fields at the same time.

(4) Drop i31ref.

This proposal also includes some things that I don't think are actually different from the current MVP:

(5) IIUC, the "gc table" description can be seen as another way of looking at existing modules' "type sections".

(6) IIUC, "ngcref" is the same as the current proposal's "optref".

(7) IIUC, getting rid of RTTs as explicit objects would fall out quite naturally from giving static types nominal semantics, as the casts in the current MVP could then simply be expressed in terms of static types (and that's essentially #119 then). That said, I have trouble wrapping my head around what this might mean for various future extensions (when you type system experts use complicated terms like "<adjective> polymorphism" ;-) ), so I can't tell whether this would be a nice simplification, or a limitation we'll come to regret.

Btw, is there supposed to be a semantic difference between <fieldtype>* and <fieldtype>+??

My personal thoughts on the differences listed above:

re (1): As Andreas points out, we'll continue to need structural checks on the module boundaries for multi-module apps; so simple nominal typing would only be applicable to intra-module type usages. We would have to be careful to specify what happens at the module boundary in such a way that it's consistent with what happens inside a module. For instance, I could imagine a model where (similar to exported/imported functions) types that are used on the boundary are also given a name, and the engine would perform structural checking on such types with identical names, just like it structurally checks the signatures of functions with matching names.

re (2): I'm not sure why this is needed? Sounds like it's supposed to make validity checking more efficient/convenient? But if an engine did two passes: one to read the types, one to validate them, then I think it wouldn't care about order? Am I missing any considerations?

re (3): In the current proposal, this is postponed to be a post-MVP feature; I think there's generally wide agreement that we'll want some form of this eventually, and the biggest question is how exactly, e.g.: (a) just one repeated/array-like field at the end of a struct, a.k.a. "arrays with custom header". From an engine implementer's point of view, I like the simplicity of this; simplicity means little memory consumption for metadata and good execution speed. (b) any number of repeated fields (as the definition above seems to read)? Note that this would make instructions slightly more verbose, as all array-related instructions (get/set by index, get length) would have to specify which array field they refer to. (c) most general but also most complicated option: arbitrary nesting of object types, which in particular would give modules control to distinguish between "struct of arrays" [(a, a, a, ...), (b, b, b, ...)] and "array of structs" [(a, b), (a, b), (a, b), ...].

re (4): I find the argument convincing that many languages would benefit from having a way to store unboxed integers in a pointer-typed slot. I also find the argument convincing that module producers should have control over this behavior, which has the unfortunate consequence that we have to spec a number of bits that all engines can support for unboxing; but on the bright side this doesn't prevent engines from additionally unboxing other values when they can. I also find the argument convincing that some producers would prefer not to pay the cost of supporting i31ref in any eqref when they have no use for it. In summary, I think #130 might be an interesting compromise here.

the gc can optimize how the various run-time types are represented in headers and how casts are performed

I wouldn't overestimate engines' appetite for such things. Sure, you could do such optimizations if you hand-crafted the optimal scheme for a small toy example. But my impression from real-world engines is that they are so complex beasts anyway that they strongly prefer simplicity/regularity wherever they can get away with it, so I would assume that any engine picks one internal object model, and then maps all features onto that. It certainly makes sense to grant engines some freedom for their internal design choices, but I wouldn't expect a whole lot of custom per-module specialization going on. (Maaaybe if we get bored in ten years we'll get interested in squeezing out another 0.5% of performance by doubling implementation complexity, but I wouldn't bet on it.)

Generally speaking, I would suggest to treat (1) - (4) above as independent suggestions, and discuss them on separate threads (if there is anything else to discuss for any of them). Untangling concerns probably makes it easier to find agreement.

RossTate commented 4 years ago

Thanks @jakobkummerow for the thoughts! This was a (year-old) stab at suggesting a very minimal MVP for the sake of prompting discussion. (That discussion didn't happen, and we were advised to write up something more fleshed out to better facilitate one, which is part of the origin of the SOIL Initiative's proposal.) As you say, there are a number of orthogonal ideas being explored here, and per more recent advice I'm now trying to take the approach you suggest of considering each of these one by one (e.g. #119 per your item (1)). You brought up a bunch of valuable points and questions here though, so I'll follow up on those here so that there's suitable context.

Btw, is there supposed to be a semantic difference between <fieldtype>* and <fieldtype>+??

Hah, yes, but it's subtle and not at all apparent in the notation. <fieldtype>* uses prefix subtyping (so you can add more fields in a subtype), whereas <fieldtype>+? lets a subtype add array fields if none were present before (the ?), but cannot add more fields if there were already fields present before. I'm just clarifying what I intended to say here; I do not defend the notation 😅

re (1): As Andreas points out, we'll continue to need structural checks on the module boundaries for multi-module apps; so simple nominal typing would only be applicable to intra-module type usages. We would have to be careful to specify what happens at the module boundary in such a way that it's consistent with what happens inside a module. For instance, I could imagine a model where (similar to exported/imported functions) types that are used on the boundary are also given a name, and the engine would perform structural checking on such types with identical names, just like it structurally checks the signatures of functions with matching names.

Structural types and nominal types are each better suited for various purposes. Compatibility of module signatures is better suited to structural types. But those signatures declare heap types, which are better suited to nominal types. The fields of those heap types are then better described structurally.

One way to think about it is that nominal types name/abstract the fixed points in recursive structures and explicitly declare the intended subtyping relationships between these fixed points. Structural checking can then proceed using these declared subtyping relationships, and there is a structural check that these subtyping relationships are sound. This is the "staging" we refer to here, and it is known to let you express more complex structures.

re (2): I'm not sure why this is needed?

It's an easy way to ensure there are no cycles in the nominal subtyping hierarchy. In the current MVP, two distinct type indices can denote the same type. Restriction (2) ensures that is not possible, simplifying type-checking.

re (3): In the current proposal, this is postponed to be a post-MVP feature; I think there's generally wide agreement that we'll want some form of this eventually, and the biggest question is how exactly

Agreed on this summary. For now, changing (3) to be just <fieldtype>?, rather than <fieldtype>+? likely addresses the most immediate needs in a way that presumably doesn't commit us to any particular strategy for more complex cases.

(Remaining points are heard. My lack of response to them is just me not having anything particularly useful to add 😄)

jakobkummerow commented 4 years ago

Thanks, @RossTate , for the comments.

This was a (year-old) stab at suggesting a very minimal MVP for the sake of prompting discussion.

Yes, I'm aware that this issue is old; it was filed before I got involved with WasmGC :-) Given the current state of the various proposals and discussions, I thought it might be useful to get back to it.

<fieldtype>+? lets a subtype add array fields if none were present before (the ?), but cannot add more fields if there were already fields present before.

Out of curiosity, what's the reason for this restriction? (I'm not opposed, just surprised; I would have guessed that adding additional repeated/array fields would be just as legal as adding singular/struct fields.)

If I had to guess: you'd expect each array field's associated length field to occur somewhere in the struct-fields part of the object (as opposed to behind the previous array field, where a bounds check would have a harder=slower time finding it), so adding more of them would shift the offsets of the elements? That would make sense; and poses an interesting limitation to possible future "arbitrary object nesting" designs... But on second thought, that would mean that subtypes also can't add additional struct fields once an array field exists on the supertype. Would that be too much of a restriction?

Structural types and nominal types are each better suited for various purposes.

Absolutely; and one way or the other we'll want/need some of both.

Compatibility of module signatures is better suited to structural types. But those signatures declare heap types, which are better suited to nominal types. The fields of those heap types are then better described structurally.

I think we might mean the same thing here. Up front, I'd like to emphasize that I'm not trying to argue that we should switch to nominal intra-module typing, I'm just attempting to explore what it would look like if we did.

Let me try again to explain what I had in mind, along with the line of reasoning: i. Premise: purely nominal types on the module boundary are problematic because they would require a single definition that all other modules import them from, and we don't want to require such a "types authority". ii. This problem is solved by making cross-module type checking structural, then modules only have to define compatible types and everything Just Works™. (See Andreas' slide deck, slide 7 ff. for an example.) iii. This requires module authors to agree out-of-band on type definitions. (A counter-example would be: if one module uses i8 arrays as (one-byte) strings, and one module uses i16 arrays as (two-byte) strings (UTF-16 or UCS2 or whatever), then those modules' string handling functions would not be interoperable, regardless of structural or nominal types). It also requires authors to agree on function names (counter-example: if one module exports a "hash" function but the other wants to import an "md5hash" function, then the import will not work). iv. We should maintain the principle that each module defines all its types, including those used for imports and exports. v. To reconcile nominal typing (where, generally, two type definitions result in two non-equivalent types) with the fact that a given type is then defined in two (or more) modules, we could use names to indicate that we want these definitions to be identified as the same nominal type. vi. While this requires authors of cooperating modules to agree on names for their interoperable types, this additional requirement is not fundamentally different from requiring them to agree on type definitions and function names.

Example, similar to the slides linked above: with the current MVP, we could have two modules:

;; module A (type $ta (array i8)) (func $f (export "hash") (param (ref $ta)) ...) ;; module B (type $tb (array i8)) (func $g (import "" "hash") (param (ref $tb)))

This works because on import, $f's and $g's signatures are checked (structurally) for compatibility, because they both have the name "hash". The check recurses into structural comparisons of $ta and $tb. I could phrase this as: the name "hash" "identifies $f and $g as the same function". The idea I was trying to sketch is that we could maintain this behavior while switching to nominal intra-module typing by explicitly identifying $ta and $tb as the same nominal type by giving them a name as well. For types, there is no sense of direction as in import/export, so I'll use a different, symmetric keyword here:

;; module A (type $ta (publicname "string") (array i8)) (func $f (export "hash") (param (ref $ta)) ...) ;; module B (type $tb (publicname "string") (array i8)) (func $g (import "" "hash") (param (ref $tb)))

The engine could use this hint in order to do exactly what it did before with structural typing: it recognizes that we want $ta and $tb to be compatible (nominally equivalent, even), so it structurally compares them to verify that. From then on, in its internal global type registry, it can treat $ta and $tb as the same nominal type.

Does that make sense? Or is there a flaw in this reasoning?

re (2). It's an easy way to ensure there are no cycles in the nominal subtyping hierarchy.

Ah, yes, that makes sense.

For now, changing (3) to be just <fieldtype>?, rather than <fieldtype>+? likely addresses the most immediate needs in a way that presumably doesn't commit us to any particular strategy for more complex cases.

Yes, agreed. What I called "(3)(a)" could later be generalized to be either "(3)(b)" or "(3)(c)", so it sounds appropriate for an MVP (or early post-MVP version, if we decide to limit ourselves to struct xor array fields in the MVP).

tlively commented 4 years ago

To clarify, under this idea, the "hash" exported from A would not be able to be imported as the "hash" in B if $ta and $tb had different publicnames because $ta and $tb would be considered different types? So the only difference between this and a pure-nominal approach is that under this idea there is no "single definition" requirement?

i. Premise: purely nominal types on the module boundary are problematic because they would require a single definition that all other modules import them from, and we don't want to require such a "types authority".

I'm having trouble reconciling this with the design of the EH proposal, which does require single definitions for exception tags. In principle it seems that tags and types should use the same cross-module usage mechanism. That the single-definition requirement hasn't been raised as a big issue on the EH proposal seems to weakly suggest that it is not so problematic after all. Where does this design goal come from?

RossTate commented 4 years ago

i. Premise: purely nominal types on the module boundary are problematic because they would require a single definition that all other modules import them from, and we don't want to require such a "types authority".

This premise is the key contention. It comes from high-level languages, but it forgets that modules of high-level languages all link with one common low-level module: the runtime (and often the standard library). As such, it's easy enough for this common module to be the "type authority" for at least the primitive types shared across the high-level modules comprising this program. They need to link with some common module anyways for other host-generated "magic number" identifiers like exception events, as @tlively points out. And the only way to avoid coordinating on RTTs is to use rtt.canon, which has its own problems including being slower than simply importing the RTT from the common module as discussed here.

Where does this design goal come from?

Back when the GC proposal was first created, the expectation was that everyone would use structural types (statically and dynamically, i.e. casts were structural as well) and as such the GC proposal would be used for interlanguage data sharing. So modules could exchange pairs just by each using the structural pair type. But I pointed out (offline) both that structural casts would be very slow and that different languages represent the same high-level structural concepts differently and as such you could not expect Java pairs to be shared as OCaml pairs or vice versa. Eventually the casting mechanism was changed to a nominal approach to address the former problem and the Interface Types proposal was introduced to address the latter problem, but the core design philosophy was never revisited despite the significant change in goals.

ii. This problem is solved by making cross-module type checking structural, then modules only have to define compatible types and everything Just Works™. (See Andreas' slide deck, slide 7 ff. for an example.)

As an example, when slide 7 was presented I pointed out that it was completely unrealistic. Most language runtimes do not represent surface-level arrays as just an array i8; they often have at least a header of some sort (e.g. a v-table or a switchable tag or something). As such, no real system would be able to use the module on slide 7. If you want to make that module usable by multiple language runtimes, it would need to import an abstract type (which would be instantiated with the given language runtime's representation of arrays) and import an accessor (or field reference or member pointer) to the array data within the abstract imported type. This is module parameterization, and it both makes the module much more usable and makes it trivial to use nominal types for the abstract import rather than structural types.

It also requires authors to agree on function names (counter-example: if one module exports a "hash" function but the other wants to import an "md5hash" function, then the import will not work).

Nominal types, contrary to their name, have nothing to do with string names, and as such they have no effect on which exports are matched with which imports.


Hope I managed to make some sense. There's an abstract mathematical argument for all this as well, but somehow I don't think that would help clarify anything 😅

tlively commented 2 years ago

Since we've settled the MVP type section and moved to phase 2 and since a lot of these individual proposed changes have already been merged or are under discussion elsewhere, I'll close this issue. If there are additional components that could benefit from further discussion, please create new issues to discuss them individually.