WebAssembly / gc

Branch of the spec repo scoped to discussion of GC integration in WebAssembly
https://webassembly.github.io/gc/
Other
1k stars 73 forks source link

Alternatives to i31ref wrt compiling parametric polymorphism on uniformly-represented values (OCaml) #100

Closed sabine closed 2 years ago

sabine commented 4 years ago

As i31ref doesn't seem to be an unanimously-agreed-on (see https://github.com/WebAssembly/gc/issues/53) part of the GC MVP spec, I am very interested in discussing what the concrete alternatives to it are in the context of parametric polymorphism on uniformly-represented values. (I would appreciate if the answer doesn't immediately read as "use your own GC on linear memory".)

To give some (historical) context: Why does OCaml use 31-bit integers in the first place? Generally, it is possible, to have a model of uniform values where every value is "boxed" (i.e. lives in its own, individually allocated, heap block). Then, every value is represented by a pointer to the heap and can be passed in a single register when calling a function. A heap block always consists of a header (for the GC), and a sequence of machine words (values). From an expressiveness standpoint, this is fine. However, when even simple values such as integers are always boxed (i.e. require a memory access to "unbox" them), performance suffers. Design constraints for the representation of unboxed integers were: a) need to be able to pass unboxed integer values in a single register, and b) need a means for the GC to distinguish (when crawling the heap) whether a value in a heap block represents an unboxed integer or a pointer to another heap block, c) being as simple as possible for the sake of maintainability. In OCaml, the compromise between performance and simplicity that was chosen is to unbox integer values by shifting them left by one bit and adding one. Since pointers are always word-aligned, this made it trivial to distinguish unboxed integers from values that live behind heap pointers. While this is not the best-performing solution (because all integer arithmetic has to operate on tagged values), it is a simple one.

Note that there exist compilation targets of OCaml that use 32-bit integer arithmetic, and the OCaml ecosystem largely accounts for that. Having libraries also consider the case where integers have 64-bits seems feasible. Some code will get faster if we can use native 32-bit integer arithmetic.

Ideally, for the sake of simplicity, we would like to emit one type Value to WebAssembly, which represents an OCaml value, which is either:

A heap block of OCaml traditionally consists of

The most trivial representation (i.e. the one matching most closely the existing one) that I see when I look at the MVP spec is an anyref array that holds both references to other heap blocks and i31ref values. So, from the viewpoint of having to do as little work as possible in order to compile to WebAssembly and keeping the implementation simple, i31ref is certainly looking very attractive for an OCaml-to-WASM compiler MVP.

In https://github.com/WebAssembly/gc/issues/53#issuecomment-546252669, @rossberg summarized:

For polymorphic languages, there are these [heap representations]:

  1. Pointer tagging, unboxing small scalars
  2. Type passing, unboxing native scalars, runtime type dispatch
  3. Type passing, unboxing native scalars, runtime code specialisation
  4. Boxing everything
  5. Static code specialisation

From OCaml's perspective, I think that (2) and (4) don't seem acceptable as a long-term solution in terms of performance. Here, compiling to the WASM linear memory and shipping our own GC seems a more attractive choice.

So, that leaves (3) and (5).

(3) seems fairly complex. If the WebAssembly engine would do the runtime code specialization, or if we could reuse some infrastructure from another, similar language, it could be worthwhile for us to work with that. It currently seems unlikely that OCaml in general will switch to (3) in the foreseeable future, unless we can come up with a simple model of runtime code specialization. I expect that implementing runtime code specialization in a WebAssembly engine goes way beyond a MVP, so it seems unlikely this will happen.

(5) is simpler than (3) in the sense that we do not have to ship a nontrivial runtime. If we analyze the whole program in order to emit precise types (struct instead of anyref array) for our heap blocks on WebAssembly, we wouldn't need to use i31ref and we can reap the other benefits of whole-program optimization (e.g. dead-code elimination, operating with native unboxed values, no awkward 31-bit arithmetic). Still, this will be a sizeable amount of work (possible too much to do it right away). I also can't say how bad the size of the emitted code will be in terms of all the types we need to emit. Instead of emitting a single Value type, we need to emit one struct type for every "shape" of heap block that can occur in the program. To keep this manageable, we need to unify all the types whose heap block representations have the same shape. Then, static code specialization kills one nice feature of OCaml: separate compilation of modules. However, instead of doing static code specialization before emitting WebAssembly, maybe it is possible to implement a linker for our emitted WebAssembly modules that does code specialization at link time if we emit some additional information to the compiled WebAssembly modules? This kind of linker could possibly be interesting to other languages that are in a similar position as us, as well. Obviously, link time will be slower than we are used to. I haven't thought this through in detail at all. It seems likely that these issues are manageable, if enough effort is put into them.

Edit: while the previous paragraph sounds fairly optimistic, looking into whole-program monomorphization (turning one polymorphic function into several non-polymorphic ones) more closely, it is definitely not trivial to implement. Types that we would need for this are no longer present at the lower compilation stages. When I look at the MLton compiler (a whole-program-optimizing compiler for Standard ML), it seems that it is a good idea to monomorphize early, in order to be able to optimize based on the types of parameters. Features like GADTs, and the ability to store heterogenous values or polymorphic functions in hash maps (or other data types) do not make it simpler. It looks to me like this would mean almost a full rewrite of the existing compiler and it is not obvious whether every function can be monomorphized (without resorting to runtime type dispatch).

Are we missing something here, are there other techniques that we have been overlooking so far? Feel free to drop pointers to good papers on the topics in general, if you know some.

Also, I am very interested what perspective other languages with similar value representations have on this and whether there is interest in collaborating on a code-specializing and dead-code eliminating linker.

sabine commented 4 years ago

you could optimize integers by making other operations slower: use anyref for $value and have your values be either boxed integers or arrays. Unfortunately this means you'll have to do a slow cast nearly every time you use a value, again because the MVP does not support case casts.

Correct. Considering that, in typical programs, references are more common than integers, this seems worse.

Still,

Oh shoot, I did forget about functors. That does complicate things. That's where more whole-program considerations would have to come in.

Functors are a very widely-used construct in real-world OCaml code.

timjs commented 4 years ago

@sabine @RossTate

(type $boxed_unboxed_integer (struct (field $value anyref)))
(type $value (struct (field $tag i32) (field $contents (ref $block_contents))))
(type $block_contents (array (mut anyref)))

Does $block_contents need to be a mutable array? I.e. do you know anything about the size of the memory block in OCaml's IR so that you can compile to structs with a given number of fields instead of arrays?

(type $boxed_unboxed_integer (struct (field $value_1 anyref)))
(type $value_1 (struct (field $tag i32) (field $field_1 (mut anyref))))
(type $value_2 (struct (field $tag i32) (field $field_1 (mut anyref)) (field $field_2 (mut anyref))))
...

Or does the MVP not allow us to cast something that is a $value_2 to a $value_1 and only use its first field?

sabine commented 4 years ago

@timjs

Does $block_contents need to be a mutable array?

For the largest part of the blocks, they are treated as immutable, after they are created. Their contents are immutable. Exceptions are mutable data structures (e.g. arrays, strings) or OCaml's ref feature which lets you create a reference to a mutable variable.

I.e. do you know anything about the size of the memory block in OCaml's IR so that you can compile to structs with a given number of fields instead of arrays?

Definitely not at every point in the computation. This was a very interesting question, as I get told there is an (unfinished) patch for the IR that adds information the size of the block to every field access. It is still an open question whether this can be added for every field access, but for everything so far, it was possible.

type foo = { foo : int; }
type bar = { bar1: int; bar2: int; }
type _ t = Foo : foo t | Bar : bar t
let foobar (type a) (w : a t) (x: a) =
  match w with
  | Foo -> x.foo
  | Bar -> x.bar1

For example, here, we can only know the size of the heap block of x after the match on w. So, that means we need to pass x to the function as an anyref and perform the type cast after we know its size.

Or does the MVP not allow us to cast something that is a $value_2 to a $value_1 and only use its first field?

That we cannot do this is how I understand it. If we could do this, we wouldn't even need to cast to the $value_i type with the correct length, we could just cast to $value_i where i equals the (largest) field index being accessed (though, it is not unlikely that casting to the correct type may be implemented in such a way that it has better performance than casting to a structural subtype).

This isn't related to the troubles introduced by i31ref, and I expect that a lot of other languages will ask for optimizations in this space at some point later in the future. So I'm not worried here that the workaround will stick with us for all time. :smile:

RossTate commented 4 years ago

@timjs The problem is that there are OCaml operations, like structural hashing/comparison, that need to be able to walk over a structure without knowing what structure it is. The only reasonable way to do that in the MVP is to make everything an array.

This isn't related to the troubles introduced by i31ref, and I expect that a lot of other languages will ask for optimizations in this space at some point later in the future. So I'm not worried here that the workaround will stick with us for all time. 😄

@sabine Don't take this for granted. The MVP is not aligned with any successful system in either industry or research. It is, however, aligned with research systems that proved to be dead ends: insufficiently expressive, slow run-time performance, too burdensome to generate, and extremely high type-annotation size. This has been a standing concern for multiple years, and so far all attempts to address it have hit the same walls that the research did. So while small improvements like arrays-with-headers should be doable, there is still no evidence that the MVP can be extended in a meaningful manner to address deeper issues, whereas there is evidence that it can't be.

gasche commented 4 years ago

@RossTate

The problem is that there are OCaml operations, like structural hashing/comparison, that need to be able to walk over a structure without knowing what structure it is.

Another approach may be to use one MVP "runtime type" per OCaml block shape, which correspond to the "tag" in your representation? If I understand correctly, one would define a structure or array type for each block shape (boxed primitive types, strings, (double)float, strings, tuples, arrays, etc.), then a runtime type representation for those types, and the runtime primitives would branch¹ on the rtt from the block pointer. One advantage is that block fields are then not forced to fit in the anyref type, specific shapes may use larger fields. One could even define structure types for program-defined types (records, arguments of variant/sum constructors), as subtypes of a more generic representation, and thus get precise types for those.

¹: currently there is no instruction for switching on rtts, one would have to do a linear sequence of equality checks if I understand correctly. It may be possible to use the subtyping hierarchy to build a dichotomic tree of rtts, but this seems a bit awkward.

RossTate commented 4 years ago

Yeah, so you're arriving at the conclusion we arrived at too. Because the MVP's type system is too inexpressive, casts will have to be frequent, and so the best thing for OCaml (and many languages) would be to have some small number of variants that can be efficiently cast to (ideally with a switch). That's why we put this proposal together (which also supports 31-bit unboxed integers, but without requiring everyone to use them).

Alternatively, the problems here could be solved by having a single reference type that can mix scalar and referential data together, along with an efficient cast to check whether a given reference is one such structure (as opposed to a funcref). That's where this conversation also led.

timjs commented 4 years ago

@sabine

Definitely not at every point in the computation. This was a very interesting question, as I get told there is an (unfinished) patch for the IR that adds information the size of the block to every field access. It is still an open question whether this can be added for every field access, but for everything so far, it was possible.

Adding this ability would help elimination an extra indirection!

That we cannot do this is how I understand it. If we could do this, we wouldn't even need to cast to the $value_i type with the correct length, we could just cast to $value_i where i equals the (largest) field index being accessed (though, it is not unlikely that casting to the correct type may be implemented in such a way that it has better performance than casting to a structural subtype).

I ask this because we were discussing an alternative approach where Wasm structs consist of a number of references tracked by the GC and a block of linear memory in #94. The idea there is that structs can be casted to "smaller" structs (i.e. having less refs and a smaller block of linear memory). I'm curious to know if such a design would help you.

@RossTate

The problem is that there are OCaml operations, like structural hashing/comparison, that need to be able to walk over a structure without knowing what structure it is. The only reasonable way to do that in the MVP is to make everything an array.

Ah, yes! I take it OCaml itself has support for this in its runtime system? Letting language designers have such an option in Wasm would need some kind of runtime type inspection, which I assume is not a part of the MVP. Is such a thing planned for in the future? Or something we'd like to avoid?

@gasche

Another approach may be to use one MVP "runtime type" per OCaml block shape, which correspond to the "tag" in your representation? If I understand correctly, one would define a structure or array type for each block shape (boxed primitive types, strings, (double)float, strings, tuples, arrays, etc.), then a runtime type representation for those types, and the runtime primitives would branch¹ on the rtt from the block pointer. One advantage is that block fields are then not forced to fit in the anyref type, specific shapes may use larger fields. One could even define structure types for program-defined types (records, arguments of variant/sum constructors), as subtypes of a more generic representation, and thus get precise types for those.

That would be exactly what I'd like to do in this case!

RossTate commented 4 years ago

Is such a thing planned for in the future? Or something we'd like to avoid?

No idea. I just know that we need to support the pattern in some way, whether or not it's through the same direct mechanism that OCaml's "standard" runtime uses.

@sabine On that point, I realized the best design for the current MVP might actually be to use v-tables (treating all reference types as nullable):

(type $vtablet (struct
  (field $hash (ref (func (param (ref $value)) (result i32))))
  (field $eq (ref (func (param (ref $value) (ref $value)) (result i32))))
  (field ...)
  (field $case i32)
  (field $apply funcref)
)
(type $value (struct (field $vtablef (ref $vtablet))))

Every value would have a pointer to its v-table. This v-table indicates how to implement various structural operations for the value's type. If the value is a case of some algebraic data type, then $case indicates which case it is. That way you can determine the case of a match and then cast to the appropriate type for that case. This sets it up so that you are only ever casting to a type when you know it has that type; no need to do successive inefficient casts to try to figure out which case it belongs to. The $apply field is for closures (see later).

(type $int (struct (field $vtablef (ref $vtablet)) (field $int_value i32)))
(global $int_vtable (ref $vtablet) (struct.new $vtablet ...)) ;; $case is 0, $apply is null

(type $float (struct (field $vtablef (ref $vtablet)) (field $float_value f64)))
(global $float_vtable (ref $vtablet) (struct.new $vtablet ...)) ;; $case is 0, $apply is null

int and float are implemented with their obvious boxed counterparts. Note that i31ref wouldn't help here. Actually, it would make things worse because then your uniform type would have to be anyref so that it could contain i31ref, which in turn means that you'd have to perform an inefficient cast to ref $value for any other purpose.

(global $nil_vtable (ref $vtablet) (struct.new $vtablet ...)) ;; $case is 0, $apply is null
(global $nil (ref $value) (struct.new $value (global.get $nil_vtable)))
;; it's okay for different nils to be distinct values and have distinct tables. the global $nil is just an optimization.
(type $cons (struct (field $vtablef (ref $vtablet)) (field $head (ref $value)) (field $tail (ref $value))))
(global $cons_vtable (ref $vtablet) (struct.new $vtablet ...)) ;; $case is 1, $apply is null

Here you see where $case comes into play. A match on a list value would first check if $case is 0 or 1. There's no need to cast the value in the 0 case since you know there's no more content to get. If it's a 1 and you need either the head or the tail or both, then you'd cast the value to ref $cons.

(func $polyvar_vtable_nullary (param $id_hash i32) (result (ref $vtablet))
  (struct.new $vtablet ...) ;; $case is (local.get $id_hash), $apply is null
)
(type $polyvar_single (struct (field $vtablef (ref $vtablet)) (field $elem0 (ref $value))))
(func $polyvar_vtable_single (param $id_hash i32) (result (ref $vtablet))
  (struct.new $vtablet ...) ;; $case is (local.get $id_hash), $apply is null
)
;; have a different type and vtable generator for each small arity
(type $polyvar_large (struct (field $vtablef (ref $vtablet)) (field $elems (ref (array (ref $value))))))
(func $polyvar_vtable_large (param $id_hash i32) (result (ref $vtablet))
  (struct.new $vtablet ...) ;; $case is (local.get $id_hash), $apply is null
)
;; it's okay to accidentally generate multiple v-tables for the same case

Here we see that $case is also useful for polymorphic variants. I'm assuming there's some process for turning the string representation of the case name into an integer representation. So long as that process is consistent, there's no need to use the same v-table for the same case (though obviously that'd be more efficient).

;; type coordinate = { xcoord : int ; ycoord : int; }
(type $coordinate (struct (field $vtablef (ref $vtablet)) (struct (field $xcoord i32)) (struct (field $ycoord i32))))
(global $coordinate_vtable (ref $vtable) (struct.new $vtablet ...)) ;; $case is 0, $apply is null

Another advantage of this strategy is that records and cases can use packed representations of their components, with the v-table taking care of specializing structural equality to the record/case at hand, rather than having every component in the uniform representation that you then have to inefficiently cast from every time you access a component.


Now to consider functions. Given a a -> b -> c, we cannot know if it is a closure waiting for two arguments, or a closure waiting for an argument that then will return another closure, or a closure waiting for more than two arguments (since the type variable c could represent a function type). This is a lot of cases to consider at every function application, which is problematic for efficiency (especially since casting is inefficient) and for code size. Ironically the callee is the one who knows best what to do, if only you had some way to call it safely without knowing its run-time arity.

We can solve this problem by using the dispatch_func extension described in WebAssembly/design#1346. This lets you define a number of "dispatch tags", and let's a funcref case match on which dispatch tag was supplied, and these dispatch tags can have different arities. So you could have dispatch.tag $apply1 : [(ref $value) (ref $value)] -> [(ref $value)] and dispatch.tag $apply2 : [(ref $value) (ref $value) (ref $value)] -> [(ref $value)] and so on up to some arity, after which you use arrays or something (the first (ref $value) is the closure itself).

Then given a value of OCaml-type a -> b -> c, you'd grab its $apply funcref and call_funcref it using the dispatch tag for however many arguments you are supplying. Suppose you supply it with two arguments, i.e. $apply2. If it's a closure waiting for two arguments, then it's case for $apply2 will redirect to a function that calls the enclosed function with the contents of the closure and the last two arguments just supplied. It it's a closure waiting for one argument, then it will redirect to a function that calls the enclosed function with the contents of the closure and the first argument just supplied, and then does call_funcref $apply1 on the resulting value with the second supplied argument. If it's a closure waiting for more than two arguments, then it'll make a new closure waiting for two fewer arguments. There will be some boilerplate to make these dispatch_funcs, but it'll be done once and reused by the entire program rather than at each call site, and it should be much more efficient. It's essentially a finitary approximation of the dynamic stack tricks that the standard OCaml runtime performs.


I realize this feels like how you'd compile OCaml to Java. This is the second time where it seems that this MVP is only well aligned with Java (and Kotlin).

sabine commented 4 years ago

@timjs

we were discussing an alternative approach where Wasm structs consist of a number of references tracked by the GC and a block of linear memory in #94. The idea there is that structs can be casted to "smaller" structs (i.e. having less refs and a smaller block of linear memory). I'm curious to know if such a design would help you.

So, if I understand this correctly, in order to have a uniform representation on #94, OCaml must box all their unboxed integers. Optimization on that can only be done by facing the nontrivial task of implementing type-directed or shape-directed unboxing, supported by code specialization, which breaks up the uniform representation into more specialized representations. Thus, I think that #94 is worse for OCaml than the current proposal with i31ref. In the existing proposal with i31ref, we would make the $value type of OCaml an anyref array. The anyref array with i31ref enables both assumptions (ability to store scalar or reference in the same "cell", and ability to check at runtime whether a cell contains a scalar or pointer) of the memory model our existing compiler seems married to. While we think we can box all the unboxed integers, recovering some of the performance loss seems only possible by means of sophisticated techniques, where the likely outcome is that we need to advise OCaml users to not use language features that are too complex or fundamentally impossible to optimize. It seems that precisely those features that make people choose OCaml (extensive support for polymorphism) would be affected.

Compared to the current proposal without i31ref, an advantage of #94 seems to be that heap blocks naturally are arrays (which is closer to our memory model than the struct types of WASM GC MVP). There is no notion that we can optimize performance by emitting lots of types. I like the simplicity in that.

For OCaml, it looks to me like the current proposal with i31ref is a reasonable target to attempt to compile to, with a realistic chance to make a compiler that people will choose over the existing OCaml-> JavaScript solutions. Even then, some people ask "but what about a 64-bit target on WASM?". The most realistic way to get that seems to be to compile to the linear memory and bring our own GC. But that's fine, having a reasonable 32-bit target on WASM goes a long way.

To compare, the current proposal without i31ref, after reading a boxed unboxed integer needs us to:

  1. read the tag of the heap block pointed to by the reference (memory read)
  2. read the actual integer value from the heap block pointed to by the reference (memory read)

vs.

  1. check whether the anyref is an integer (bit check)
  2. extract the integer from the anyref (shift operation depending on signed or unsigned)

This is all under the assumption to reuse the existing compiler to the extent that is reasonably possible.

gasche commented 4 years ago

@sabine anyref array works for many OCaml blocks (that only contain OCaml values), but there are other blocks whose tag is above No_scan_tag, whose payload should not necessarily be scanned by the GC: strings, double floats and double float arrays, custom blocks, etc. If you use anyref array as the block type you need to box the opaque payloads, either word-by-word (in the array) or, preferably, with the second anyref pointing to a precisely-typed representation in these case. (I suspect that the overhead of the extra indirection for those blocks would be sensibly lower than imposing boxing of immediate values, so that is probably okay.)

@RossTate I haven't had time to look at your alternative MVP proposal, sorry.

Regarding vtables, I'm not sure what you gain compared to just having the block tag (your $case field), and runtime operations switching over this value to cast to the appropriate type. I suspect that dispatching on the vtable is typically slower than having all the runtime-operation code in a single function, due to loss of code locality (but it may allow to benefit from finer-grained static block-shape information, what you call the "packed representation", so maybe this could be a win).

(talking about i31ref again, sorry.) Regarding typed block shapes, In the current MVP one may use width subtyping by having the tag be the first field of each struct. One design could thus be:

value: anyref (unboxed immediate or pointer to a block)
immediate: i31
block pointer: (ref (struct i32 ...))

tuple<n> block:     (struct i32 anyref^n)
record<n> block:    (struct i32 (mut anyref)^n)
double block:       (struct i32 f64)
int64 block:        (struct i32 i64)
closure block:      (struct i32 funcref i32 funcref (ref (array anyref)))    (see at the end for more information)
array block:        (struct i32 (ref (array anyref)))
double-array block: (struct i32 (ref (array f64)))
string block:       (struct i32 (ref (array i8)))
custom:             (struct i32 (ref custom-ops) (ref (array i64)))

All the array fields correspond to extra indirections that could be avoided with a trailing-inline-array or array-with-header design, but I think that the cost may actually not be that large (assuming the array is allocated close to the value); again, smaller than the overhead of boxing all immediate values without substantial (and slightly unrealistic-sounding) changes to the compilation model. One thing I don't realize is how cheap or costly would be the cast from anyref to ref (struct i32), and from ref (struct i32) to ref (struct i32 ...) for each tag.

The current OCaml GC uses an i8 for the tag, packed with GC information etc. in a single header word. If typical Webassembly runtimes do not allow packing struct data with their own metadata, using an i32 instead opens the door for finer-grained tags corresponding to program-defined types, which could eliminate anyref loss-of-precision for statically typed tuples, records and variant parameters.

Re. currified applications (apologies @RossTate for not answering your questions on currification/application earlier): OCaml closures contain two code pointers, one that exposes a "naive" API where arguments are passed one by one (fun x -> (fun y -> (fun z -> foo)))), and one that expects the "static" arity of the function definition (3 in this example if foo is not a fun ..), with an "arity" field that stores this expected-arity information. If you want to call an unknown function and pass N arguments, you first check if its expected arity is N (this is the fast-path), or you fallback to passing the arguments one-by-one. (Implementation note: to avoid code-size blowup, the "pass arguments one by one" and "expect arguments one by one" parts are factored into helper functions caml_curry<n> and caml_apply<n> that are generated by the compiler at link-time. See generic_functions in the compiler code.)

sabine commented 4 years ago

@gasche

Yes, I think the use of runtime types to represent a block's tag, so that we know what type to cast it to is a valid choice in the current MVP. And you are right, the presence of heap block shapes that can store i32 values is what breaks the uniformity of the representation with plain anyref arrays.

Edit: to add to that, I hadn't considered this at the point where I commented on #94, so that puts #94 in a better shape, since we have to do type casts all over the place in order to allow the shape of the heap blocks that contain i32 values. Will reevaluate later.

2 hours later: Hmm... what about #94, but with i31ref support in the reference array? The generic heap block of OCaml is represented by using the reference array with unboxed scalars of the WASM heap block. Heap blocks of OCaml that hold 32-bit values (e.g. string, Int32, array, etc.) are represented using the linear memory array of the WASM heap block.

RossTate commented 4 years ago

@gasche Thanks for the info on how the standard OCaml runtime deals with currying!

One thing I don't realize is how cheap or costly would be the cast from anyref to ref (struct i32), and from ref (struct i32) to ref (struct i32 ...) for each tag.

In the current MVP, an rtt cast entails the following steps:

  1. if the value being cast is an anyref, check that the value is not an unboxed scalar
  2. load the array of rtts from the reference at the appropriate offset in the heap
  3. load the size of the array
  4. check that the size is greater than than the index of the rtt being cast to
  5. load the rtt in the array at the index of the rtt being cast to
  6. check that the loaded rtt is referentially equal to the rtt being cast to

This is not cheap, especially since it involves double indirection. I don't know of any runtime that has such casts as regularly parts of the hot path (and remember that WebAssembly is not supposed to need speculative techniques like inline caching to achieve decent performance).

So one way to evaluate these various designs is to consider how many rtt casts they require. Let's consider in particular two programs: fold_left (+) 0 nums and fold_right (+.) 0.0 nums. The design I gave involves 3 * |nums| + 1: one for each time the fold casts the list to a Cons, two for each call to the implementation of + and +. in order to cast their arguments to the expected types, and lastly one to cast the final result to the expected type. None of these casts are from anyref and so can skip the check for unboxed scalars. However, although my design skips the need to cast closures here, it does involve a load of the v-table in place of the cast. Similarly, although my design skips the need to cast to get the case, it does involve involve a load of the v-table in place of the cast.

On the other hand, the design @gasche just gave involves, for each iteration element in nums:

  1. Cast the list to a block pointer (even in the nil case).
  2. Cast the list to a tuple<2>.
  3. Cast the function to a closure. (Can be amortized in the fold_left case.)
  4. Cast the two arguments to (+) to i31ref and the two arguments to (+.) to int64.

This boils down to 2 * |nums| + 3 in the fold_left (+) 0 nums case (not counting the 2 * |nums| casts to i31ref), and 5 * |nums| + 2 in the fold_right (+.) 0.0 nums case. Plus all but |nums| casts are from anyref and so involve a check for unboxed scalars.

So its hard to say which design would perform better for fold_left (+) 0 nums, but it seems very likely that mine would perform better for fold_right (+.) 0.0 nums (and probably for fold_right (+) 0 nums and fold_left (+.) 0.0 nums as well).

I suspect that dispatching on the vtable is typically slower than having all the runtime-operation code in a single function, due to loss of code locality (but it may allow to benefit from finer-grained static block-shape information, what you call the "packed representation", so maybe this could be a win).

Good point, though the locality cost (if any) would only happen in structural operations, whereas the benefits of packed representations would happen regularly. Plus having the funcref for application in the v-table I think will help optimize OCaml's frequent use of closures.

P.S. @sabine @gasche @timjs This is a great brainstorming discussion! I hope this sort of thing will happen for every language targeting WebAssembly. 😄

gasche commented 4 years ago

I'm not that worried about the cost of rtt checks for typical "block shapes", which would be declared all together as the rtts of a handful of structural types. If we do those checks all the time, the rtt values will be in cache, so the dereferences should be relatively cheap. (Of course supporting variants/enumerations-with-parameters in the language would be even faster, as switching on the tag would then be enough to learn the static type information, instead of having to combine with a runtime check.) Following an indirection which is part of our working data is worse, as it may typically have poor locality and cache behavior.

  • Cast the list to a block pointer (even in the nil case).
  • Cast the list to a tuple<2>.

In the current OCaml implementation, the nil case [] is represented as the (tagged) integer 0 (this is a natural consequence of the fact that parameter-less constructors are represented as immediate integers), so the cons-cell is the only possible block for list-typed values, and pattern-matching on a list never checks the block tag, directly accessing its fields. This would correspond to casting the anyref to tuple<2> directly if it is not a tagged integer. So there would be the ref.is_i31 check in any case, and a single reference cast in the cons-cell case. (Same with 'a option, OCaml's Maybe.) Of course, most datatypes have several non-constant constructors, so there one would use two casts.

If you use int31ref with 31-bits integers, (+) also doesn't need any rtt cast, just ref.as_i31 which is much cheaper. For (+.), again the static types tell you that the values you get can only be double, so you never check the tag, you convert to a double block right away.

Without counting the function usage, and ignoring int31ref checks, I count |nums| rtt casts in the unboxed-integer case, and 3*|nums| rtt casts in the boxed-double-float case.

When you have type information (on lists, on floats, etc.), the code generator knows what block shape it wants to use, so in general you should only need a single cast from "uniform value representation" to "more precise type" -- but you still do need it. (Being able to dispatch on all possible shapes only comes up in ad-hoc runtime operations.)

RossTate commented 4 years ago

Ah, nice observation that, because nil can be check for via ref.is_i31, and because Cons is the only block-pointer case, we can skip the cast to get the case info. That only eliminates |num| casts though, so for fold_right (+.) 0.0 nums you still have 4 * |nums| + 2 casts (cast to Cons, cast to closure, cast args of (+.) to boxed doubles). That said, that improvement probably puts the two strategies in the same ballpark, at least for algebraic data types with just one non-nullary case.

sabine commented 4 years ago

@timjs @RossTate @taralx @gasche

I suspect that, if the structref proposal #94 supported i31ref in the reference array, this is a better heap model for OCaml than the current MVP proposal:

value: anyref (unboxed immediate or pointer to a block)
immediate: i31ref
block pointer: (ref structref)

tuple<n> block:     (structref n 1)   (one byte for the tag and n references)
record<n> block:    (structref n 1)
double block:       (structref 0 9)    (one byte for the tag, and 8 bytes for a f64)
int64 block:        (structref 0 9)
closure block:      (structref (n+2) 5)    (one byte for the tag, 4 bytes for the arity,
                                            two function references and n values
                                            as the environment of the closure)
array block:        (structref n 1)
double-array block: (structref 0 (1+8*n))   (one byte for the tag, 8 bytes for each f64)
string block:       (structref 0 (1+n))     (one byte for the tag, one byte per char)
custom:             (structref 1 (1+n))     (one byte for the tag, one function reference,
                                              as many bytes as needed)

Edit: the numbers are off. In practice, we need to align the values so that they can be read in a single memory acccess. This is something that should be worked out in proposal #94 or by using explicit padding.

We can work around not having i31ref in #94 by using roughly twice the amount of memory and only one more memory access:

Note that, if there is no i31ref, we do get 32-bit integer arithmetic, at the cost of one additional memory access for loading the "boxed" integer (which is only boxed from OCaml's viewpoint, but not in the WASM GC heap model, where it lives in the same struct, but in a different place). Edit: and at the cost of using twice as much memory, and having to copy twice as much memory when duplicating an object, etc.

We don't need to make up a hierarchy of types and casts between them, the bounds checks are easy for us to emit from the existing compiler.

Considering that I'm a really junior person working on the OCaml compiler, I might miss something here that makes this all fall apart. Please go ahead pick this apart.

gasche commented 4 years ago

@RossTate: thinking about this more, I think that vtable-based and rtt-based tag checks would have very close performance profiles (with the rtt representation sketched in the proposal): we are following a pointer which we expect to be in the cache (there is a small number of block rtts or vtables), then doing some cheap operations on the dereferenced block (in the vtable case, only a field read, in the rtt case we do three reads on the block and an equality check). rtts are also more likely to benefit from some sort of optimization by the underlying engine.

Regarding closures, I think that one principled way to amortize the rtt cast (to avoid doing one on each application) would be to use a lightweight typed compilation, where each type variable is typed in the wasm translation by its "shape": any function type a -> b is typed as a reference to a closure block, an integer is just a i31ref, etc. (This sounds similar to "transient" gradual typing.) With this compilation strategy (which would require some extra type propagation or local type reconstruction in the type-erasing OCaml compiler), fold_left (+) would not require any closure-specific cast: (+) is already known at the precise type ref closure-block, and fold_left expects a ref closure_block as first argument. On the other hand, Fun.id (+) requires a cast (Fun.id is the polymorphic identity), as function application in wasm returns an anyref value, which needs to be refined again into a closure reference. This is similar to some type-based unboxing approaches, which are known to have the downside of sometimes applying more coercions than necessary, changing the space complexity of a computation (note: we are not wrapping functions in coercions here). For example, if iterate : int -> ('a -> 'a) -> 'a -> 'a is repeated function application (function exponentiation), then iterate n Fun.id (+) would perform n unnecessary casts. This can be a very costly corner-case for unboxing, as the unecessary boxing/unboxing would cause n allocations, but this is actually fine for casts that preserve the value unchanged, so neither space nor time complexity have changed here. (Also, one half of the wrapping/unwrapping operations are just subtyping, so completely free.)

gasche commented 4 years ago

Here would be a proposal for a slightly different i31ref design: instead of being a subtype of anyref, this could be a modality on reference types, like null is used to indicate nullability: ref i31 <heaptype> would be either a reference or an int31 value (ref null i31 <heaptype> would also exist).

With this design, one could use a uniform block representation that is finer-grained than anyref, saving a cast. In the (struct i32 ...)-based representation I proposed above for OCaml, one would use ref i31 (struct i32) as the uniform value type, so no cast would be necessary to access the block tag in the block case.

When type information is available, we could even use a more specific type, for example a lists could be represented as

(type $list (ref i31 $cons-block))
(type $cons-block (struct i32 $value $list)

which would require no cast at all when following the list spine.

gasche commented 4 years ago

@sabine your proposal would be fine (some numbers need tweaking). The main improvement compared to the one in https://github.com/WebAssembly/gc/issues/100#issuecomment-654214132 is that there is no need to have an extra indirection for arrays/strings. (For closures, one could avoid the indirection by having an environment-length field and casting to the refined struct type. For arrays and strings, this indirection could be avoided by post-MVP arrays-with-headers extensions.)

On the other hand, because the representation of blocks is less precisely typed, you need extra casts when you access the data in some cases. You don't need cast if you access a reference value as a uniform value (for example a field out of a closure), and I expect that the reinterpretation-cast for non-reference values (from bytes to typically int64 or float64) would be very cheap. But in the case of closures, you would need a cast from an arbitrary reference to a function pointer, and those could be expensive (I don't know how the engine would implement this). With the more-typed design, funcref are directly available, so the type-system guarantees that closures contain callable functions. Depending on the cost of anyref-to-funcref casts, this could drown out the performance benefits for arrays and strings.

sabine commented 4 years ago

in the case of closures, you would need a cast from an arbitrary reference to a function pointer, and those could be expensive

That's a good point I hadn't considered: functions pointers are the weak point in this largely untyped heap. Can that be resolved by refining #94 into a better proposal?

The engine must perform some checks on the signature of the function pointer and the state of the heap at runtime anyways, does that affect the cost of a type cast from the generic reference type to a function reference?

One existing workaround is to use indices into a table to represent the function pointers on the heap. But that's certainly not very cheap either. So, in the long run, there should be a reasonably efficient way to deal with function pointers on the heap.

gasche commented 4 years ago

Thinking about this more: if the engine maintains runtime type information on the side (which is suggested by the MVP proposal at least) and this information can be trusted for security/safety purposes, then casting an anyref to a funcref needs not be more expensive than any other runtime-checked cast. For the purpose of imaginary considerations on performance, you could just count this as an extra checked cast. Then this is just an extra cast in the (fast-path) case of function application compared to the more-typed proposal, which does not sound so bad; and the "structref bounds/width assertion" that you need to access the block fields might be cheaper in the structref design than a rtt cast in the typed design (not sure).

On the other hand, the typed design opens the door to the idea of generating more precise types during compilation, with precise subtypes of the uniform value type, which could save many casts. This wouldn't be possible in the lower-level structref design, if structref bounds are not part of the static type information.

chambart commented 4 years ago

Sorry for jumping in a middle of a discussion, I'll probably comment on the rest later for other things, but right now, there is something that I (and others) find a bit strange. I don't really see why only i31ref should exist. I understand that as this is currently expressed, i-n-ref is a subtype of anyref, which mean that anyref must be large enough to represent it. And of course forcing every engine to use larger than 32 bits addresses is unacceptable. But it looks like the type ordering relation here is not the right one. There could be one more type (I'll call it 'reforinteger'). anyref would be a subtype of it. i-n would be a subtype of it too. That way anyref would not have to be able to store it, so 32bits pointers would still be doable. But also you could have two versions 'reforinteger-32' and 'reforinteger-64' of which i31 and I63 would be their interger subtypes. Note that this pattern could even allow to represent nan-boxing by having a reforfloat type representing both pointer or f64.

gasche commented 4 years ago

Above I suggested to use a reference-modifier like null, so ref i31<reftype>, it looks like you are proposed a more general feature tagged <immtype> <reftype>.

sabine commented 4 years ago

On the other hand, the typed design opens the door to the idea of generating more precise types during compilation, with precise subtypes of the uniform value type, which could save many casts. This wouldn't be possible in the lower-level structref design, if structref bounds are not part of the static type information.

So, in proposal #94, it should be possible for the producer to emit structref bounds as part of the static type information, in order to allow engines to optimize when this information is provided by the WASM producer.

@chambart Making tagged integers a supertype of anyref instead of a subtype is a very interesting idea because it would enable much more flexible use, and the potential for, e.g. Ruby, OCaml to establish a reasonable 64-bit target on the WASM GC.

I found the discussion about tagged integers at https://github.com/WebAssembly/design/issues/919. I read this as: @lars-t-hansen proposed a supertype to the pointer type that can store tagged values, but this was not pursued further because the pointer in that supertype was untyped (and would thus require a type cast before accessing it).

Has enabling unboxed integers as tagged <immtype> <reftype> been seriously considered elsewhere? There seems to be a qualitative difference to the proposal in https://github.com/WebAssembly/design/issues/919, since the tagged type does include both types, the immediate type and the reference type.

taralx commented 4 years ago

There's the soil-initiative alternate design, which has something like a genericized "ref i31 typ".

RossTate commented 4 years ago

Lot's of cool thoughts here.

But in the case of closures, you would need a cast from an arbitrary reference to a function pointer, and those could be expensive (I don't know how the engine would implement this).

A cast to a funcref should be just as efficient as to the other "primitive" reference types. The information might even be encoded by some engines as a bit directly in the pointer. Casting to a typed function, on the other hand, would be more expensive. But there's not really a need to do that.

There's the soil-initiative alternate design, which has something like a genericized "ref i31 typ".

To give some context, one of the premises of this alternative design is that efficient case-testing/casting/switching would be useful for an MVP that needs much more casting than other systems due to its coarse type system (and which would ideally support runtimes for untyped languages that also rely heavily on efficient casting). The design recognizes that each language has its own casting/representation needs, and that a universal reference type like anyref causes languages to compete with each other (it seems like the ideas in WebAssembly/design#919 were ruled out basically because of the conflict with a universal reference type), so the design lets each module define its own reference types and keeps them from getting mixed up. The design also recognizes that the engine has its own needs, and so it has languages specify the high-level casting structure they need and lets the engine determine how best to implement that casting structure (including unboxing scalars and such) within its own infrastructure. So an OCaml module could say one of its cases is immutable signed 31-bit integers, and the engine would decide whether/how to pack or box those integers. The design ensures that the choice made is unobservable (besides the performance difference).

RossTate commented 4 years ago

We can work around not having i31ref in #94 by using roughly twice the amount of memory and only one more memory access:

Although not discussed there yet, something that would make sense for #94 is to have a way to coerce i32 to and from primitive references, with again the expectation that these coercions would be pretty efficient. Whether they'd be boxed or not may or may not depend on the engine. Regardless, this would further reduce the cost of not having i31ref.

timjs commented 4 years ago

Ignored this discussion for four days and it exploded again 😄 Below I tried to structure some replies on comments of the last few days.

@sabine

I suspect that, if the structref proposal #94 supported i31ref in the reference array, this is a better heap model for OCaml than the current MVP proposal.

This is exactly what I had in mind, but obviously didn't take the time to explain as thoroughly as you did now. I think the memory model using structref is simpler and hopefully more flexible than the current MVP. It naturally supports mixing refs and values in one heap object and bounds checks are cheaper than the proposed rtt casts. Adding i31ref to the proposal is just an optimization.

We can work around not having i31ref in #94 by using roughly twice the amount of memory and only one more memory access.

I haven't thought of this one before, and I like it! The solution I had in mind for an i32 array was using n refs pointing to new structrefs containing just the integer. But this solution uses 2 * (n + c) memory, where c is the size of each structref's header, instead of 2 * n in your solution.

That's a good point I hadn't considered: functions pointers are the weak point in this largely untyped heap. Can that be resolved by refining #94 into a better proposal?

I did not think about the casting of function pointers at all. Well not a type safe cast as we'd like to have in Wasm... All FP backends I know just "use the pointer as a function pointer" because they already statically know this is true 😅

@gasche

With this design, one could use a uniform block representation that is finer-grained than anyref, saving a cast. In the (struct i32 ...)-based representation I proposed above for OCaml, one would use ref i31 (struct i32) as the uniform value type, so no cast would be necessary to access the block tag in the block case.

But this design can still be used with the current MVP isn't it? Only Wasm itself wouldn't statically now about the possibility that we'd like to store an i31 in the anyref:

(type $list anyref)
(type $cons-block (struct i32 $value $list)

Case splitting on something of type $list would branch on a cast to i31ref: if succeeding, we have a Nil; if failing, we have Cons.

Thinking about this more: if the engine maintains runtime type information on the side (which is suggested by the MVP proposal at least) and this information can be trusted for security/safety purposes, then casting an anyref to a funcref needs not be more expensive than any other runtime-checked cast. For the purpose of imaginary considerations on performance, you could just count this as an extra checked cast. Then this is just an extra cast in the (fast-path) case of function application compared to the more-typed proposal, which does not sound so bad; and the "structref bounds/width assertion" that you need to access the block fields might be cheaper in the structref design than a rtt cast in the typed design (not sure).

I think it is true that bound assertions are cheaper than rtt casts. But I don't know how to extend #94 with rtts and if it is necessary to do so. Only thing you can do with structrefs is expose the number of references and the size in bytes of the linear memory in the type, so bound checks kind of refine the type of a structref. However, all references in a structref will be anyrefs always. Or we have to extend the type of structrefs once more, to include the type of all its references. I really don't know if that is something which is helpful to do, and if it will save us from a lot of casts.

@RossTate

A cast to a funcref should be just as efficient as to the other "primitive" reference types. The information might even be encoded by some engines as a bit directly in the pointer. Casting to a typed function, on the other hand, would be more expensive. But there's not really a need to do that.

It is good to know that casting to a funcref shouldn't be expensive. Can you elaborate more about your statement that there is no big need to cast a typed function pointer?

P.S. @sabine @gasche @timjs This is a great brainstorming discussion! I hope this sort of thing will happen for every language targeting WebAssembly. 😄

I love these kind of discussions too!

gasche commented 4 years ago

One way to understand @sabine's interesting twice-the-amount-of-memory proposal is that in a design that does not allow for unboxed immediates, one approach is to make all values (struct anyref i64) (meaning the immediate if and only if the pointer is null), and use the struct-of-array optimization for any sequence of such values. But besides memory usage, this design also pays a somewhat heavy price in terms of extra allocations, as reading a uniform-value representation from an array always boxes. (Floating-point numbers in OCaml behave more or less in this way already, they are unboxed in arrays and otherwise boxed, with local unboxing optimizations. It works, but there is a noticeable overhead when mixing floating-point code and parametric functions.)

@timjs:

(type $list anyref)
(type $cons-block (struct i32 $value $list))

Case splitting on something of type $list would branch on a cast to i31ref: if succeeding, we have a Nil; if failing, we have Cons.

The problem is that when you learn that the value is not an immediate, you still don't know that it is a ref $cons-block, you have to do a cast for this. If instead of anyref you use my proposed (ref i31 $cons-block), then the knowledge is statically available. (You don't even need a bound check as in the less-typed proposal.)

gasche commented 4 years ago

Can you elaborate more about your statement that there is no big need to cast a typed function pointer?

The standard approach here, if we work with uniform / dynamic / weakly-typed value representations, would be to cast to a function pointer whose type is basically anyref anyref ... -> anyref, by casting the arguments instead of trying to cast (or wrap) the function pointer itself. You will need a cast for the arity at application time, but you dot not need to cast the arguments (the function body itself can do it) or the return value (the consuming code itself can do it). Of course, when you do have more information statically on the function type, then it is definitely interesting to cast to a more-typed representation right away, to avoid losing precision on the argument and return types.

RossTate commented 4 years ago

Regarding function references, a call to an "untyped" funcref is just a little slower than what a call to a typed function reference would be (see WebAssembly/design#1346 on dispatch tags for more info), faster than what it would take to first cast that funcref to a typed function reference (especially using the current MVP's design), and an engine can even amortize the small overhead over successive calls. So a cast to funcref can be done quickly (since it's a coarse/primitive type), and there's not really good reason to cast to a more precise type. (I also talked here about how to make dispatch tags for each arity, and even how to implement dynamic arity tricks with funcref.)

aardappel commented 4 years ago

@gasche

instead of being a subtype of anyref, this could be a modality on reference types, like null is used to indicate nullability: ref i31 would be either a reference or an int31 value (ref null i31 would also exist).

That's a really nice idea actually, that way any languages not needing i31ref in combination with engines where pointer tagging is optional would not pay the price of this extra bit check.

So null makes a ref nullable.. and i31 makes a ref.. scalarable? :P

sabine commented 4 years ago

I don't know how to extend #94 with rtts and if it is necessary to do so

My impression is that #94 does not need a rtt mechanism, because the only "typing" #94 provides is the size of the reference array and the size of the linear memory array (and this linear memory is a sequence of bytes, just like the other WASM linear memory). Internally, every heap block must carry around these bounds (for security reasons, to prevent out-of-bounds access). The bounds here are effectively the type.

A WASM producer whose compilation strategy involves checking which shape a block is at runtime could place a header in the linear memory, and, based on that, assert different bounds. Or, #94 could expose the bounds checks as explicit operations, and a producer, who uses an encoding that allows to distinguish their blocks at runtime only based on the bounds, can use that.

I might be missing the greater point of runtime types, though, that cannot be achieved by this simple method.

The appeal of #94 is really that this feels closer to a mental model that producers are already accustomed to. They have full control over arranging their linear memory array as they see fit. The open question is: what are the qualitative differences to the existing proposals? And what #94 it look like when we add all the things that are needed for efficiency's sake?

sabine commented 4 years ago

@rossberg @aardappel @RossTate what's your assessement of a type tagged <immtype> <reftype> as a supertype of anyref - as proposed by @chambart, and given syntax by @gasche?

This would additionally enable 63-bit scalars, at the cost of producers who want to use 63-bit scalars using more memory for references when storing them on the heap. Producers who do not use the 63-bit or 31-bit scalars simply do not use them.

I have the impression that handling this as a supertype of the general reference type could be the correct way to deal with scalars and references that live in the same heap cell. This way, engines can internally use 32-bit or 64-bit pointers, without exposing this to the outside world. At the same time, it becomes possible to efficiently represent 63-bit scalars on the heap - which means producers that rely on efficient scalar representation on the heap can compile to 64-bit WASM.

RossTate commented 4 years ago

Something needs to be clarified before I can give thoughts. Is tagged a type constructor? In particular, can different modules specify different sizes for <immtype>?

sabine commented 4 years ago

If I understand this correctly, tagged <immtype> <reftype> is a new type where <immtype> is either i63 or i31, and ref type is a regular reference type. Please correct if wrong @chambart. Edit: I think there was the point about float nan-tagging in there, too, so tagged f64 anyref should be possible.

E.g., a value of type tagged i63 anyref can be either i63 or anyref. A value of type tagged i31 (ref $x) can be either i31 or ref $x.

Edit: different modules could most likely use different tagged-types.

gasche commented 4 years ago

Yes, this is the idea. The storage size of tagged <immtype> <reftype> would be max(size(<immtype>) +1, size(<reftype>)). If we wanted to be even more precise, we could have a sort of aligned 8 <reftype> form to reduce the storage size of a reftype through an alignment guarantee, and use tagged i31 (aligned 2 <reftype>), with size(tagged i r) = 1 + max(size(i), size(r)).

RossTate commented 4 years ago

I'm not sure my question was answered. Let me lay out the problem. Suppose one module use tagged i30 ... and another module uses tagged i31 .... In the context of #94, these both need to be coercible from primref. That means primref needs to be able to tell whether a given reference was created as tagged i30 versus tagged i31. Thus bits will have to be used to determine which immediate type was use, making the information no longer fit into the given space. So this only really works (in a design with a universal representation) if there is a universally agreed upon immediate type (or maybe two, one for large immediates that would be boxed on engines with 32-bit pointers, and one for small immediates).

So now let's suppose we choose to support just i31 and i63 to address the above problem. I think it's safe to say that i63 doesn't work. It's incompatible with NaN boxing, which is an option that should be left open (at least, with the understanding that WebAssembly is supposed to be compatible with a wide variety of implementation techniques rather than prescribe specific ones).

As for i31, I think it makes the wrong tradeoffs for #94. In #94, one of the most frequent operations you will be doing is casting primref to the various primitive reference types. So if I were using 32-bit pointers, I would think a viable strategy would be to use the low 2 bits of the pointer to flag various common reference types, e.g. 01 for structref, 10 for funcref, 11 for some other common reference type (I have ideas for what that should be, but I won't go into that here), and 00 for "other" references. Even if I reserve 11 for scalars, that's just 30 bits. And before we go into considering i30, another factor to consider is the complication to garbage collection itself. With the approach I give above, precisely garbage collecting a structref is pretty easy; you just walk through the reference elements, mask off the low 2 bits, and then proceed to the referenced value (using standard tricks for dealing with nulls). With i30, you'd need to branch before dereferencing each element, slowing down the whole process.

Fundamentally, in any design with a universal representation such as the current MVP and #94 (but not the SOIL Initiative's preliminary design) unboxed scalars come at a cost to all languages that do not use them. Meanwhile, something like i32ref (and/or i64ref) would not have such costs but would at least enable more compact/efficient means for boxing scalars (that some engines might even be able to keep unboxed). It's also easier to add unboxed scalars as a feature later, if we discover it's more useful and less costly than we expected, than it is to remove the feature. So in a system with huge backwards-compatibility considerations, I think it's better to go with the more conservative option.

jakobkummerow commented 4 years ago

Letting Wasm modules specify custom tagging schemes like tagged i31 anyref, tagged i63 anyref, tagged f64 anyref seems to be at odds with:

engines can internally use 32-bit or 64-bit pointers, without exposing this to the outside world

because it effectively forces engines to use the prescribed pointer width and tagging scheme anywhere where these types propagate.

Also, consider subtyping: a subtype that refines a field type from tagged i63 anyref to just anyref would have to store that anyref with 64 bits as well (so that its layout is compatible with its supertype's). If you stick to the structural subtyping of the current proposal, then I think the sheer possibility of such subtyping relationships would force all reference fields everywhere to be 64 bits wide; if you switch to nominal subtyping, then (a) that's a much bigger change and (b) that would have the consequence that you can't compute a struct's on-heap layout without inspecting all of its supertypes as well.

More generally: I believe there's a hard requirement that references to any type in a subtyping hierarchy have the same bit width. So "anyref" can't possibly be a subtype of both a 64 bit wide tagged i63 anyref and a 32 bit wide tagged i31 anyref.

For unboxed integers, 31 bits is the obvious width: it's the largest value (hence maximizing benefit/applicability) that's small enough to let engines choose their own pointer width and tagging scheme (maximizing implementation freedom -- as was pointed out before, object pointers can still use extra tag bits, e.g. <...31bits...>1 could be i31ref and <...30bits...>10 and <...30bits...>00 could be differently tagged object pointers; and of course NaN-boxing is compatible with i31ref as well).

This isn't to say that we must have i31ref, just that: if we want a guaranteed-unboxed iXref, then X=31 is the way to go.

I'd be perfectly fine with having a fairly-generic reference type that's guaranteed to not be an i31ref, to shave off a machine instruction or two from code that doesn't need it. That could be some sort of annotation (like the (ref null? i31? $t) suggested above), or a new type in the predefined hierarchy, e.g.: i31ref <: eqref <: anyref $t <: heapref <: eqref for all struct/array types $t

sabine commented 4 years ago

@jakobkummerow @RossTate Ok, I got this wrong. I said make tagged a supertype of anyref, and that cannot work because type casting is an operation that cannot change the bit representation of the value. That makes sense. The same goes for primref of proposal #94 which also cannot be a subtype of tagged.

Does it make more sense as a separate type that is not in a type hierarchy with anything else, and where we have operations to extract the immediate or the reference?

tagged i63 anyref tagged.from_imm $immediate_value tagged.from_ref $ref_value tagged.is_imm $tagged_value tagged.as_imm $tagged_value tagged.as_ref $tagged_value

Conversion from tagged i63 anyref to anyref, if anyref is represented by 32 bits would mean to take the lower 32 bits from the value of type tagged i63 anyref. The point here would be that extracting and wrapping values from/into a tagged-value does not require a memory access - it only uses arithmetic, logic or shift instructions.

In proposal #94, I think, introducing tagged means that you need to choose on a per-structref basis whether its reference array contains taggeds or primrefs. So, a structref's type becomes structref <tagged-or-untagged> N M where <tagged-or-untagged> is either some tagged <immtype> <reftype> or some <reftype>. The reason why this needs to be chosen on a per-struct basis is that, for the GC to walk a structref, it needs to be able to distinguish references and immediates, and we do want to avoid non-users of tagged to incur a tag-check on every pointer access.

So, you could have a structref (tagged i63 primref) N M whose low-level representation of the pointer array is a sequence of 64-bit values which either represent a 63-bit scalar or a primref.

Okay, this looks to me like #94 with tagged (internally?) needs a header that describes to the GC the type of the values in the reference array. This is starting to look somewhat like a runtime type (in addition to the existing bounds). @timjs @gasche do you see a simpler way to introduce something like tagged in #94 (where "being like tagged" means that it allows to represent 63-bit scalars in the reference array without generally forcing a pointer width on the WASM engine)?

@RossTate I haven't been doing the soil initiative proposal justice by so far not working out the shape of the OCaml heap on that proposal. I just spent some time looking through the spec, trying to piece things together. I suppose I am having difficulty wrapping my head around this because the soil-initiative model is so far away from my own mental model of the OCaml heap (and there are a lot of attributes on the types). It seems to me like the soil-initiative model is an attempt to make a very abstract, customizable heap model.

I guess, the block shapes of the existing compiler could look similar to this:

$uniform
:= scheme.new (extensible (cases $immediate $value))

value :=      scheme.new ((field $tag (unsigned 8) immutable) castable
immediate :=   scheme.new 
                         (field $value (unsigned 31))
block pointer := (gcref $value)

tuple<n> block: scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                         (field length $tuple_length (unsigned 32) immutable))
                         ?? - array of gcref $value
record<n> block:    same as tuple<n>
double block: scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                          (field (unsigned 64) immutable))
int64 block:        same as double block
closure block:  scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                         (field funcref immutable)
                         (field $arity (unsigned 32) immutable))
                         (field funcref immutable)
                         (field length $environment_length (unsigned 32) immutable))
                         ?? - array of gcref $value
                             )
array block:    scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                         (field length $array_length (unsigned 32)))
                         ?? - array of gcref $value
                             )
double-array block: scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                         (field length $array_length (unsigned 32)))
                         ?? - array of (unsigned 64)
string block:    scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                         (field length $string_length (unsigned 32)))
                         ?? - array of (unsigned 8)
custom:         scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                         (field funcref immutable)
                         (field length $length (unsigned 32) immutable))
                         ?? - array of gcref $value
gasche commented 4 years ago

@RossTate @jakobkummerow you both seem to be assuming that tagged <immtype> <reftype> is a reference type -- that it needs to be a subtype of #94's primref/pointer or the MVP's anyref. But I don't see why that would need to be the case; it is perfectly fine to have explicit operations tagged.check, tagged.as_imm and tagged.as_ref to coerce into the immediate / reference representation, which do something at runtime (they are not witnessing a subtyping relation). In particular size(tagged i r) and size(r) do not need to be the same.

Edit: ah, @sabine said just the same thing slightly earlier. Apologies for the redundancy.

gasche commented 4 years ago

I think #94 should be fleshed out as a summarized proposal; right now people are discussing it as a single proposal while taking some points in the middle of the discussion, it is confusing/difficult to keep track of what they mean.

@sabine: I think the simplest way to extend what-I-understand-as-#94 with tagged <immtype> <reftype> would not be to have three sized regions in each structref: one for immediates (size I), one for references (size R), and one for tagged values (size T). If having three regions is not easy from an implementation perspective, one could require that either R=0 or T=0: either a block uses only pure references, or only tagged references, but not both.

aardappel commented 4 years ago

@sabine @gasche I would say that if tagged potentially being a different size from anyref would cause 3 rather than 2 size fields, that would be enough to prefer the simplicity of having just i31, in the context of #94.

In the context of the MVP however, (with coercion, not subtyping), allowing several scalars types could be worth it, as the cost of ani63 in a tagged is not greater than generally storing an i64. Making this a coercion seems generally saner than subtyping, and it accomplishes the goal of making only producers that use these types pay for their cost (in engines that do not use tagging already).

In fact, this modularizes the i31ref feature, to the point where you could go ahead with a GC MVP proposal that does not contain it, and make it an add-on (which may be useful if we want to debate further tagging schemes).

It will have the downside that access to the JS world is defined entirely in terms of anyref, and a JS API allows us to store an "abitrary" value, then an OCaml producer, say, cannot simply pass its i31ref as-is, and will need to box. That is probably acceptable.

RossTate commented 4 years ago

@RossTate @jakobkummerow you both seem to be assuming that tagged <immtype> <reftype> is a reference type -- that it needs to be a subtype of #94's primref/pointer or the MVP's anyref.

@sabine @gasche The arguments I gave are not about types, they're about values. The purpose of using smaller scalars like i31 and i63 is entirely to let scalar values occupy spaces primarily intended for reference values, whether those spaces be the non-linear fields of a structref or wherever anyref happens in the current MVP. The costs I gave have to do with the optimized representations for references that get pushed out by scalars despite likely being more useful for more languages (possibly including even OCaml).

sabine commented 4 years ago

@aardappel For OCaml, I think it is enough in #94 to have a runtime type for the heap block that specifies the type of values in the reference array (tagged with 64 bits, tagged with 32 bits, or primref). I agree that maintaining three size fields would be too much.

Explicit coercion between anyref and a tagged reference value seems the right thing to do. The cost model here is predictable, since tagging and untagging are ALU operations.

There's nothing wrong with having i31ref as subtype under anyref, if that helps JavaScript interop, or even just to provide something that can be used right now. We can use that as a workaround, to implement a 32-bit target for the OCaml compiler on WASM while we wait for tagged to implement a 64-bit target.

Though, tagged looks to me like the more correct thing to implement in the long run, precisely because languages that do not use tagged do not need to pay for it and because languages that do use it get the 63-bit tagged that they will use in their 64-bit targets.

@RossTate

The purpose of using smaller scalars like i31 and i63 is entirely to let scalar values occupy spaces primarily intended for reference values

Ah, the purpose you assume is more specific than what we actually need:

In the OCaml heap model, we let 31-bit and 63-bit scalar values occupy spaces that can be occupied by either scalars or references.

We do not need our scalars to fit in the same space as a generic reference value.

What we do need is a type that can store both scalars and references with reasonable efficiency. tagged as a type coercible to and from anyref looks like it fits the bill.

gasche commented 4 years ago

@RossTate I still don't understand your argument. In a design where tagged <immtype> <reftype> is available, then anyref does not need to be include i31ref, and can remain represented using 32 bits. The only assumption that we need for tagged to make sense is that pointers are 2-aligned. (This could be made statically explicit with a type aligned 2 <reftype>, but we haven't been using this in the discussion.) Then if you use tagged i31 anyref, you get what is currently anyref (in the current MVP that includes i31ref), and this can fit in a 32bit word. If you use tagged i63 anyref, you get a new type that requires 64 bits of storage space. The fact that this type can be expressed does not change any property of anyref itself. Yes, if now you start to use tagged i63 anyref heavily in your programs, those program will consume more memory than if they used tagged i31 anyref. The choice is entirely left to the producer of the code.

The existence of i64 or f64 in wasm does not mean that anyref has to be 64bit-wide, and tagged i63 anyref is just the same.

In the current MVP proposal, anyref is the only type that can efficiently represent either references or scalars, so it is the only sensible choice for a "universal type" in a language that needs a uniform representation for both scalars and values. If tagged is available, then you get several sensible choices "universal types", with different sizes: tagged i31 anyref, tagged i63 anyref, but also more precise types like tagged i31 (ref (struct i32)) or `tagged i63 (ref (struct i32)) (in the case of OCaml where blocks are known to start with a block tag). Users (or in our case, compiler authors) can choose the type they prefer for their usage.

(In my intermediate proposal (ref i31 <reftype>), you can only tag with i31, but retain the ability to use either anyref or more precise types as the reference type.)

RossTate commented 4 years ago

Y'all are assuming that anyref or primref do not have anything better to do with those lower 2 bits, and so there's implicitly free space making it possible for tagged i31 ... to have room for a reference and for a scalar. What I am saying is that there might be better uses for those lower 2 bits rather than to leave room for scalars. For example, primref might want to use the bits to distinguish at least funcref and structref values in order to permit fast casting. And anyref might want to use bits completely differently to support NaN boxing.

In other words, while tagged i31 primref might make boxing integers faster for OCaml (by not boxing them), it has the cost of making either casts to structref or casts to funcref slower for OCaml and for everyone else.

aardappel commented 4 years ago

@sabine

There's nothing wrong with having i31ref as subtype under anyref

It means we commit to checking that bit in any anyref we want to access as ref, forever, even in Wasm engines that don't also have a JS implementation, and could otherwise store/access anyref as a naked pointer.

gasche commented 4 years ago

@RossTate we could think about having, for example, (tagged structref funcref). In other words, maybe the strategy of exposing tagging structures in types can also fit your favorite needs.

My impression is that (ref i31 <reftype>) and (tagged <type> <type>) are interesting "alternatives to i31ref" (they feel much more realistic to me than monomorphisizing everything), and they would each design a dedicated PR to expose a design and foster further discussion.

(Again, the primref PR should really write its design down to summarize the #94 discussion. This would also help with discussing (ref i31 <heaptype>) and (tagged <type> <type>) in the context of that lower-level proposal.)

RossTate commented 4 years ago

we could think about having, for example, (tagged structref funcref). In other words, maybe the strategy of exposing tagging structures in types can also fit your favorite needs.

@gasche, you seem to be describing a heterogeneous system, meaning there is no global tagging structure. But then you need structref to say what kind of reference fields it has, e.g. (tagged structref funcref), but then the structrefs in those field presumably have the same tagging convention, leading to recursive types. See this proposal for how to carry that out.

(Again, the primref PR should really write its design down to summarize the #94 discussion. This would also help with discussing (ref i31 ) and (tagged ) in the context of that lower-level proposal.)

We're working on it, but it will not have a structref type that is parameterized by the type of its reference fields. One of the benefits of the proposal is its simplicity, and furthermore that complexity doesn't solve problems without introducing recursive types (see above).

gasche commented 4 years ago

Out of curiosity, I manually boxed an OCaml program performing integer arithmetic, in order to evaluate the performance overhead of systematic integer boxing on the current runtime. (The function I used is what I call "sumtorial", like factorial but with a sum instead of a product, basically a complex way to complex n*(n+1)/2.)

let sumtorial n =
  let rec loop acc = function
    | 0 -> acc
    | n -> loop (acc + n) (n - 1)
  in loop 0 n

module BoxedInt = struct
  type t = Int of int
  let (+/) (Int a) (Int b) = Int (Stdlib.(+) a b)
  let (-/) (Int a) (Int b) = Int (Stdlib.(-) a b)
end

let boxed_sumtorial n =
  let open BoxedInt in
  let rec loop acc = function
    | Int 0 -> acc
    | n -> loop (acc +/ n) (n -/ Int 1)
  in loop (Int 0) n

let locally_unboxed_sumtorial n =
  let open BoxedInt in
  let rec loop acc = function
  | Int 0 -> acc
  | Int n ->
    let (Int a) = acc in
    loop (Int (a + n)) (Int (n - 1))
  in loop (Int 0) n

boxed_sumtorial is an exact translation of the sumtorial definition, with boxed integers instead of unboxed integers. locally_unboxed_sumtorial is a version where I manually inlined the arithmetic constants and operations and applied local boxing/unboxing simplification.

On my machine, the boxed version is precisely three times slower than the unboxed version. The locally_unboxed_sumtorial is not much faster than boxed (3.6s rather than 3.9s on a test), because for this program it only eliminates redundant unboxing, there is no redundant boxing that can be eliminated locally. (Of course we could change the calling convention of the function to take unboxed integers, but this is not a local transformation anymore.)