WebAssembly / gc

Branch of the spec repo scoped to discussion of GC integration in WebAssembly
https://webassembly.github.io/gc/
Other
985 stars 71 forks source link

Alternatives to i31ref wrt compiling parametric polymorphism on uniformly-represented values (OCaml) #100

Closed sabine closed 1 year ago

sabine commented 4 years ago

As i31ref doesn't seem to be an unanimously-agreed-on (see https://github.com/WebAssembly/gc/issues/53) part of the GC MVP spec, I am very interested in discussing what the concrete alternatives to it are in the context of parametric polymorphism on uniformly-represented values. (I would appreciate if the answer doesn't immediately read as "use your own GC on linear memory".)

To give some (historical) context: Why does OCaml use 31-bit integers in the first place? Generally, it is possible, to have a model of uniform values where every value is "boxed" (i.e. lives in its own, individually allocated, heap block). Then, every value is represented by a pointer to the heap and can be passed in a single register when calling a function. A heap block always consists of a header (for the GC), and a sequence of machine words (values). From an expressiveness standpoint, this is fine. However, when even simple values such as integers are always boxed (i.e. require a memory access to "unbox" them), performance suffers. Design constraints for the representation of unboxed integers were: a) need to be able to pass unboxed integer values in a single register, and b) need a means for the GC to distinguish (when crawling the heap) whether a value in a heap block represents an unboxed integer or a pointer to another heap block, c) being as simple as possible for the sake of maintainability. In OCaml, the compromise between performance and simplicity that was chosen is to unbox integer values by shifting them left by one bit and adding one. Since pointers are always word-aligned, this made it trivial to distinguish unboxed integers from values that live behind heap pointers. While this is not the best-performing solution (because all integer arithmetic has to operate on tagged values), it is a simple one.

Note that there exist compilation targets of OCaml that use 32-bit integer arithmetic, and the OCaml ecosystem largely accounts for that. Having libraries also consider the case where integers have 64-bits seems feasible. Some code will get faster if we can use native 32-bit integer arithmetic.

Ideally, for the sake of simplicity, we would like to emit one type Value to WebAssembly, which represents an OCaml value, which is either:

A heap block of OCaml traditionally consists of

The most trivial representation (i.e. the one matching most closely the existing one) that I see when I look at the MVP spec is an anyref array that holds both references to other heap blocks and i31ref values. So, from the viewpoint of having to do as little work as possible in order to compile to WebAssembly and keeping the implementation simple, i31ref is certainly looking very attractive for an OCaml-to-WASM compiler MVP.

In https://github.com/WebAssembly/gc/issues/53#issuecomment-546252669, @rossberg summarized:

For polymorphic languages, there are these [heap representations]:

  1. Pointer tagging, unboxing small scalars
  2. Type passing, unboxing native scalars, runtime type dispatch
  3. Type passing, unboxing native scalars, runtime code specialisation
  4. Boxing everything
  5. Static code specialisation

From OCaml's perspective, I think that (2) and (4) don't seem acceptable as a long-term solution in terms of performance. Here, compiling to the WASM linear memory and shipping our own GC seems a more attractive choice.

So, that leaves (3) and (5).

(3) seems fairly complex. If the WebAssembly engine would do the runtime code specialization, or if we could reuse some infrastructure from another, similar language, it could be worthwhile for us to work with that. It currently seems unlikely that OCaml in general will switch to (3) in the foreseeable future, unless we can come up with a simple model of runtime code specialization. I expect that implementing runtime code specialization in a WebAssembly engine goes way beyond a MVP, so it seems unlikely this will happen.

(5) is simpler than (3) in the sense that we do not have to ship a nontrivial runtime. If we analyze the whole program in order to emit precise types (struct instead of anyref array) for our heap blocks on WebAssembly, we wouldn't need to use i31ref and we can reap the other benefits of whole-program optimization (e.g. dead-code elimination, operating with native unboxed values, no awkward 31-bit arithmetic). Still, this will be a sizeable amount of work (possible too much to do it right away). I also can't say how bad the size of the emitted code will be in terms of all the types we need to emit. Instead of emitting a single Value type, we need to emit one struct type for every "shape" of heap block that can occur in the program. To keep this manageable, we need to unify all the types whose heap block representations have the same shape. Then, static code specialization kills one nice feature of OCaml: separate compilation of modules. However, instead of doing static code specialization before emitting WebAssembly, maybe it is possible to implement a linker for our emitted WebAssembly modules that does code specialization at link time if we emit some additional information to the compiled WebAssembly modules? This kind of linker could possibly be interesting to other languages that are in a similar position as us, as well. Obviously, link time will be slower than we are used to. I haven't thought this through in detail at all. It seems likely that these issues are manageable, if enough effort is put into them.

Edit: while the previous paragraph sounds fairly optimistic, looking into whole-program monomorphization (turning one polymorphic function into several non-polymorphic ones) more closely, it is definitely not trivial to implement. Types that we would need for this are no longer present at the lower compilation stages. When I look at the MLton compiler (a whole-program-optimizing compiler for Standard ML), it seems that it is a good idea to monomorphize early, in order to be able to optimize based on the types of parameters. Features like GADTs, and the ability to store heterogenous values or polymorphic functions in hash maps (or other data types) do not make it simpler. It looks to me like this would mean almost a full rewrite of the existing compiler and it is not obvious whether every function can be monomorphized (without resorting to runtime type dispatch).

Are we missing something here, are there other techniques that we have been overlooking so far? Feel free to drop pointers to good papers on the topics in general, if you know some.

Also, I am very interested what perspective other languages with similar value representations have on this and whether there is interest in collaborating on a code-specializing and dead-code eliminating linker.

RossTate commented 4 years ago

Although you've ruled it out, I would actually suggest (4) for OCaml, albeit with some optimization.

First, the optimization I have in mind is to have OCaml function definitions operating on int compile to wasm function definitions operating on i32. For polymorphic uses of these functions, have another definition that works on whatever "uniform" representation of OCaml values you use, simply calling the optimized function with appropriate (un)boxing. Or more generally, when a value can be statically determined to be an int, represent it as i32.

There are a few reasons I suggest this:

  1. A number of other languages have found this approach effective enough. See various JVM languages like Java, Kotlin, and Scala that all take this approach.
  2. It's still possible the MVP will have i31ref. It's also possible it'll instead have something like i32ref, which would generally be boxed on 32-bit machines and unboxed on 64-bit machines.
  3. Even if the MVP doesn't have these things, the MVP is just the first release. WebAssembly will advance, possibly in ways that would make specifically 31-bit integers often (if not always) be unboxed for modules that specifically request it.

So in short, I suspect the short-term costs of this approach are more acceptable than you might expect (with some local/composable optimization techniques), and I suspect it is more likely to adapt well to integrate the improvements that will be made to WebAssembly over time.

There's also a meta-point to consider: having WebAssembly modules compiled from OCaml in this manner will give the CG realistic programs to experiment with and analyze, which can be used to determine if it's worth the effort (at all levels of the process) to give modules more control over their low-level representations. Right now we can only hypothesize.

That all said, you know your specific needs much better than I, and even if this approach might work well for the OCaml community broadly, it still might not be well-suited to more specialized concerns you might have.

sabine commented 4 years ago

Fair enough, (4) is close enough to (1) to look like the next best solution in terms of effort needed and we could use i32 arithmetic. I'll look into this in more detail to see if it is indeed so straightforward.

sabine commented 4 years ago

I'm wondering if I can do this by first compiling to "WASM GC MVP with i31ref" and then translating that code to "WASM GC MVP without i31ref". At least, this way, I will not have to go all the way back to unbox all these integers again when a feature that is equivalent to 31ref arrives. Has anyone done that?

RossTate commented 4 years ago

Even with i31ref, you have to coerce integers to and from i31ref using instructions i31.new and i31.get_u/i31/get_s. So, if the design of i31ref changes, it should be pretty trivial to change your compiler to substitute in the appropriate coercion instructions.

The more difficult thing to change would be using 31-bit vs. 32-bit arithmetic. It sounds like putting in the effort for 31-bit arithmetic is only worth it if i31ref exists. Since that's not guaranteed, I'd use easier 32-bit arithmetic for now, and then put in the effort to switch to 31-bit arithmetic only if/when i31ref has been finalized.

rossberg commented 4 years ago

@sabine:

Then, static code specialization kills one nice feature of OCaml: separate compilation of modules.

That would be the smallest problem. More importantly, as you note in your edit, static specialisation kills any form of first-class polymorphism, of which OCaml has plenty (polymorphic methods and record fields, first-class modules, GADTs, ...). So I think this approach is simply impossible to use. The same is true for any other language with a rich enough type system.

@RossTate:

For polymorphic uses of these functions, have another definition that works on whatever "uniform" representation of OCaml values you use, simply calling the optimized function with appropriate (un)boxing.

That is not a scalable approach. In languages like OCaml, you have polymorphic type inference, and by design that results in way more polymorphism than in traditional languages. You also use lots of abstract types. In particular, many functions are polymorphic in multiple independent types, and refer to multiple abstract types. To be efficient, the approach you suggest would require an exponential number of specialisations to be generated upfront for each such function.

There also is the practical problem of porting. If the GC proposal does not support established implementation techniques, but instead requires a major rewrite of a pre-existing compiler and runtime, then it will not be a plausible target for many existing languages.

RossTate commented 4 years ago

The advice I gave here is specific to OCaml. There is only one type we were talking about doing this for: int. You can do these coercions per use site, and you can consolidate coercions if you want. These are conceptually coercions between monomorphic and polymorphic calling conventions. There will need to be coercions between curried and uncurried calling conventions anyways. That is, an int -> int -> int can and should be called with two arguments (i.e. uncurried), but it can also be given to of type (int -> 'a) -> int -> 'a, in which it must be called with one argument (i.e. curried). That polymorphic function will then return a function using a curried calling convention and so that result often needs to be coerced to an uncurried calling convention. That is, unless you are suggesting that all functions always use the curried calling convention, since WebAssembly does not currently support the sort of variadic arguments that would be necessary to eliminate these coercions. So my suggestion just hooks into infrastructure that would need to be in an OCaml-to-WebAssembly compiler anyways.

gasche commented 4 years ago

There has been a good amount of research work on type-directed or shape-directed unboxing, supported by code specialization. For a recent example, see the work on call-graph-based specialization of (boxed) polymorphic functions as part of the 2017 thesis of Dmitry Petrashko on Scala. But these work typically require a fair amount of sophistication. In contrast, having an efficient way to "box" immediate scalars without going through the heap is easy for backends to allow, and fairly easy to compile into for types that fit the smaller width. The two approaches are complementary (clever specialization optimizations has other benefits), and yes, boxing everything is of course possible if the low-level substrate does not provide better, but getting reasonable performances requires much more sophistication on the user side.

Many of the successful language implementations work by keeping a tight check on their complexity budget: they go for the 20% of sophistication that brings 80% of the benefits in as many areas as possible. OCaml was faster than most ML, or Chicken Scheme than most Schemes, not through extremely optimizations, but rather thanks a few very-important optimizations (static function calls) and good data-representation choices that ensured that typical programs were fast by default.

It is also possible to build fast systems that are slow by default but fast through excellent, sophisticated engineering (good JVMs or CLR implementations, GHC, Graal+Truffle, etc.), but I agree with @rossberg that it is important to ensure that the 20%-80% approaches are available first.

RossTate commented 4 years ago

The issue is "alternatives to i31ref", and the concerns are fairly specific to OCaml due its use of specifically 31-bit arithmetic. (Haskell, for example, only guarantees 30-bit precision because they have found reserving 2 bits for GC to be useful.) There is already another very long thread, #53, about i31ref. Let's not pull this thread into an argument about i31ref, an argument that cannot be conducted well if we have not even considered the alternatives, which is what this thread is about.

rossberg commented 4 years ago

Consider a function like this one from the Map interface:

val mapi : (key -> 'a -> 'b) -> 'a t -> 'b t

This is polymorphic in two type variables and two abstract types. Each of them could either be an integer or something else. So even with just int, that makes 2^4 = 16 different ways in which you'd have to specialise the code according to your suggestion (8 if the compilation scheme allows to fix the export type t).

Or you reduce the number of specialisations, lump several dimensions together, and pay the extra cost of frequent boxing/unboxing conversions for the rest. Many variations of this have been investigated in highly polymorphic languages, and there are reasons few are used in practice. Leroy (among others) actually had a line of papers researching unboxing for polymorphic languages, yet OCaml, like many similar languages, uses a uniform representation instead. He observed that the conversion overhead often is higher than the benefit of unboxing, and that a minimal unboxing scheme produces better results for typical profiles.

I'm not intimately familiar with how OCaml's native code compiler handles currying, but the byte code compiler handles it through clever dynamic stack techniques that are unrelated to compile-time unboxing.

rossberg commented 4 years ago

the concerns are fairly specific to OCaml due its use of specifically 31-bit arithmetic.

No, they're not, and it's getting a bit boring at this point that I have to keep refuting that. So again: plain ints are just one particular use case. The general purpose is unboxing small scalars of any sort, for which there are many, many use cases, in many, many languages. (And of course, 30 bit ints would fit just fine as well.)

jayphelps commented 4 years ago

As a bystander who follows these threads I wanted to share some feedback that this aggressiveness is quite off-putting.

RossTate commented 4 years ago

but the byte code compiler handles it through clever dynamic stack techniques that are unrelated to compile-time unboxing

Unfortunately, WebAssembly does not support these techniques. So I believe coercions will be necessary instead. @sabine, have you considered this problem with currying?

To clarify, I was not suggesting specializing polymorphic functions based on their type arguments. I was suggesting coercing monomorphic functions that work specifically on integers to polymorphic functions operating on the uniform representation (when used polymorphically). This could be done with an augmentation of the coercion system I was suggesting above for dealing with currying.

rossberg commented 4 years ago

To clarify, I was not suggesting specializing polymorphic functions based on their type arguments. I was suggesting coercing monomorphic functions that work specifically on integers to polymorphic functions

Oh, okay, but that's probably even worse. You would have to box/unbox in all first-order contexts where you pass a value from polymorphic to monomorphic functions and vice versa, which is like all the time. And you'd still have the same problem: for a function of type int->int->int, there are 8 different possibilities of a polymorphic context to which it could be passed in a higher-order fashion. All would have a different calling convention under the approach you suggest. So either your function has 8 different specialisations, or you're creating wrapper functions at each polymorphic use site.

aardappel commented 4 years ago

@sabine thanks for summarizing things so clearly.

First, until we invent JIT and other features, the current Wasm, even when including the current GC proposal, is an entirely static, dare I say "closed world" system, with AOT compilation to a single .wasm, with concrete types at the boundary. By definition, all forms of polymorphism, modularity, separate compilation etc will be resolved by the time that .wasm gets baked, no matter what the language likes to present beforehand. This alone should make (5) by far the most realistic option for most languages that care about speed, again, in the current way Wasm works.

There seems to be a lot of default fear of (5), that it going to result in unreasonable code bloat. If you generate specializations under a closed world assumption, you only generate the ones that are actually used. This causes more in-lining of single-use functions (not further increasing code size) followed by much more aggressive optimisation of inlined code working directly on specific naked (scalar) types, often shrinking the code back down significantly. It turns indirect calls into static calls, helping greatly with dead-code elimination. LLVM and binaryen can do endlessly more on code like this than code that stays generic. In my personal experience working with (5) a lot, the amount of code bloat can be remarkably small, and the resulting speedups due to code "collapsing" impressive.

But @rossberg is correct that anyone that prefers (5) can make this happen, regardless of the presence of i31ref. So why would we want i31ref additionally?

I'd like to see an example of polymorphism that is impossible to compile down to specialized code using (5), and requires a struct of all anyref to work. I think, by definition, that such an example does not exist (given that we don't have any dynamic code features in Wasm that would make this statically impossible).

There will be people who have code size as their #1 concern, and don't believe my story above that it can actually help reduce code size, when done right. That's fine, because it is hard to make claims about this in the absence of specific languages with specific compilers and specific applications.

Finally, are there downsides to having i31ref when you don't use it? Under what circumstances does a Wasm implementation have to check "is this thing an i31ref?" before accessing a pointer? If anyref is always a pointer, always pointing to a heap object where the initial bytes always have the same layout (e.g. a type id), I can imagine that would lead the speedups. If indeed there are frequent checks required for i31ref, that to me would be the strongest reason to not wanting to have it.

rossberg commented 4 years ago

@aardappel, repeating what I pointed out above, and I believe many times before: static type specialisation does not work in a language with first-class polymorphism or polymorphic recursion.

Simplest possible example in OCaml:

type poly = {f : 'a. 'a -> unit}
let g {f} = f 1; f "boo"

There is no way you can statically specialise anything here, since you don't know what polymorphic definition you are calling. It's completely dynamic. A client of g could pass anything for f, it could be taken from a list, or any dynamically computed data structure. Inversely, the site defining an f generally has no way of knowing where it flows and what types it will be used at.

The same is true for generic methods in a language like C#, which is why the CIL performs jit-time type specialisation. And it is the reason why C++ does not allow "template virtual methods", because its simplistic compilation model cannot support them.

Static specialisation also does not work in a language with polymorphic recursion, because the set of instantiations can be statically unbounded.

If I could make a wish, then that folks in these discussions could accept the fact that there are very good reasons why certain classes of languages are implemented in certain ways, that there has been decades of engineering and research into it, that we are not smarter than all the many folks who did this, and that we need to support a mapping for those compilation techniques instead of hoping for some yet undiscovered trick to make these requirements go away. Please?

gasche commented 4 years ago

First, until we invent JIT and other features, the current Wasm, even when including the current GC proposal, is an entirely static, dare I say "closed world" system, with AOT compilation to a single .wasm, with concrete types at the boundary. By definition, all forms of polymorphism, modularity, separate compilation etc will be resolved by the time that .wasm gets baked, no matter what the language likes to present beforehand

I'm confused, doesn't wasm allow producing modules and linking them together? I would expect compilers to use those to produce modules at the natural separate-compilation granularity of their source language. In particular, I would expect to compile each module/crate/package separately, without making assumptions on how other modules are going to use it. Of course you can still generate WASM code only after a pass of link-time-optimization, but I would naively expect that it can lead to code-distribution issues (if the generated WASM for my library depends on the clients it's compiled with, code caching etc. is going to be more delicate to handle).

I'd like to see an example of polymorphism that is impossible to compile down to specialized code using (5), and requires a struct of all anyref to work.

It's easy to have examples where specialization leads to impressive blowups in code size. I mentioned the Scala specialization work by Dmitry Petrashko earlier; it starts by evaluating previous 2014/2015 work on call-graph-based type specialization in Scala on the codebase of the Dotty compiler, and found out that on average each method would be specialized 22 times.

Then there is the example of polymorphic recursion, where the set of different type instantiations depends on runtime data (it particular it may be arbitrarily large). (In most cases I'm familiar with, the number of different "shapes" may be bounded, especially if you only distinguish immediates vs. pointers.)

But then what about union types? Many language runtimes use representations that are either an immediate word or a pointer, with tagging to distinguish the two. This is ubiquitous for example in implementations of ML, Haskell, Scheme and what not. You can very easily have tuples or arrays of such immediate-or-pointers values, where the immediate-or-pointerness of any element is only known at runtime and can change on mutation. How would one operate on this data if functions are expected to know statically whether they take an immediate or a pointer value?

Under what circumstances does a Wasm implementation have to check "is this thing an i31ref?" before accessing a pointer? If anyref is always a pointer, always pointing to a heap object where the initial bytes always have the same layout (e.g. a type id), I can imagine that would lead the speedups.

I'm not very familiar with Wasm (I was drawn into this discussion by chance), but I believe that:

If I understand correctly, Wasm implementations try to be safe from undefined behavior, so in particular they are careful to check that pointers go into allowed memory before dereferencing them. This monitoring discipline, and in general wasm sandboxing/CFI features, are going to completely drown out the cost of checking bitpatterns on a word (even if you decided to do checks or masking on each dereference).

aardappel commented 4 years ago

@rossberg your example is essentially a function pointer to a generic function if I'm not mistaken, which is indeed not possible in languages that rely entirely on monomorphic specialization. I'd expect a compiler to choose to force boxing on args for such functions (causing an extra heap alloc for f 1).

I don't see why generic functions that are not used in this very dynamic fashion (which I would certainly hope are the vast majority) should suffer from the mere presence of this possibly very dynamic use, and certainly not why Wasm as an eco system should pay for this cost (if the mere presence of int31ref introduces conditionals). I'd hope the cost to only be born by functions that use this functionality.

I am well aware that there are endless people who spend longer time looking at this than me, but just an "appeal to authority" is not going to be sufficient to stop me from at least questioning this direction.

RossTate commented 4 years ago

If I could make a wish, then that folks in these discussions could accept the fact that there are very good reasons why certain classes of languages are implemented in certain ways, that there has been decades of engineering and research into it, that we are not smarter than all the many folks who did this, and that we need to support a mapping for those compilation techniques instead of hoping for some yet undiscovered trick to make these requirements go away. Please?

The Glasgow Haskell Compiler, which presumably is considered a production compiler within this class of languages, does not unbox Int even though the Haskell semantics specifically permits Int to have only 30 bits (a number they arrived at because other runtimes that do unbox Int found it useful to reserve two bits for GC). From what I understand, they did so because they found that algebraic operations were so common that it was more useful to embed the algebraic-constructor tag in the low bits of the pointer. So the experience of Haskell developers is that either custom pointer tagging should be preferred over i31ref for better performance or i30ref should be preferred over i31ref for better GC strategies. Having looked through a number of language implementations, this sort of gap seems pretty common. Unfortunately for OCaml, it is an outlier because I believe it is the only language anchored specifically around 31-bit integers. This is why I thought it would be useful to focus specifically on advice for OCaml.

aardappel commented 4 years ago

@gasche

I'm confused, doesn't wasm allow producing modules and linking them together? I would expect compilers to use those to produce modules at the natural separate-compilation granularity of their source language.

Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore. This may change in the future but is TBD. Also, current linking of multiple Wasm modules is runtime-only, though static linking is planned.

and found out that on average each method would be specialized 22 times.

Average? Each method in the entire compiler? Meaning many methods would be specialized hundreds of times? (to account for the methods only used once). I find that hard to understand how that is possible. Does this account for optimization and DCE of the now much more static blown up code?

Would be more useful to know how much a codebase would blow up in total, for example the ratio of AST nodes of a large program before and after specialization. In my tests for example, in cases with heavy nesting of higher order functions, that ratio was at most 2x (after optimization). All very anecdotal of course.

If I understand correctly, Wasm implementations try to be safe from undefined behavior, so in particular they are careful to check that pointers go into allowed memory before dereferencing them. This monitoring discipline, and in general wasm sandboxing/CFI features, are going to completely drown out the cost of checking bitpatterns on a word (even if you decided to do checks or masking on each dereference).

An anyref is entirely under the control of the Wasm implementation, thus typically would need no checking of any kind before dereferencing as a pointer. Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.

(sandboxing is only done for linear memory pointers, but even there it has no cost typically due to use of memory mapping features)

rossberg commented 4 years ago

@aardappel:

I don't see why generic functions that are not used in this very dynamic fashion (which I would certainly hope are the vast majority) should suffer from the mere presence of this possibly very dynamic use

You don't know at its definition site which polymorphic function ends up being used that way, and with what degree of polymorphism. The composition can happen somewhere else entirely, e.g. in a different module. And it can be partially instantiated, i.e., you plug in a polymorphic function somewhere where a less polymorphic (but not monomorphic) type is expected. There are many degrees of freedom, and you generally need a compilation scheme that is both efficient and compositional for all of them.

In the limit, the possibilities for polymorphic composition are almost similar to those in a dynamic language, which is why they make similar representation choices. (Though an important difference is that there usually is no observable type dispatch, because polymorphism is fully parametric.)

@RossTate:

As I have stated many times before, whether it's i31 or i30 is largely immaterial. This is an internal representation type. For the vast majority of use cases its max size doesn't matter. However, we can't easily make extra pointer bits available to user space across engines anyway, so AFAICS, we can just as well provide 31 bit range for tagged ints. But if there are reasons to pick another width, then that's totally fine as well. I just haven't heard them yet.

There are many languages that use (wordsize-1) bit integers in their implementations one way or the other. Whether a language implementation exposes that to users is a completely separate question. Some do (not just OCaml), some also expose values like 30, 29, or 28. But where they do, the language definition typically does not guarantee an actual size. For example, OCaml explicitly states that int_size can have other values. So again, size doesn't matter.

@aardappel:

Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore.

Huh? I'd fully expect that a language with a proper module system can (and should!) compile its modules to Wasm modules separately, and without any loss of generality. Wasm's module system allows you to import/export anything, so such a compilation scheme should be perfectly possible.

An anyref is entirely under the control of the Wasm implementation, thus typically would need no checking of any kind before dereferencing as a pointer. Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.

You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.

jakobkummerow commented 4 years ago

Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.

You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.

That is indeed the one place where adding i31ref has a non-zero cost: when ref.test or ref.cast or br_on_cast operate on a value that's statically typed as anyref or eqref, then before loading and comparing that value's type descriptor, they first have to check the pointer's tag bit. Given that the type descriptor check (which needs to happen with or without i31ref's existence) involves a load from memory, and the tag bit check doesn't, I'm confident that the additional overhead is negligible. Any code that has more specific static knowledge than eqref is entirely unaffected by the existence of i31ref. Code that passes around anyref values (without downcasting them) is just as unaffected.

sabine commented 4 years ago

As a bystander who follows these threads I wanted to share some feedback that this aggressiveness is quite off-putting.

I don't think anyone contributing here wants to come off as aggressive (I certainly don't want to :smile:). This i31ref part of the spec is a rather controversial issue in both a technical and political sense: (a) WASM is this great thing that everyone wants to be part of and that, by aiming for strong safety guarantees, sets the bar quite a bit higher for everyone, compared to the commonly-used traditional architectures that we are used to. (b) so, all kinds of different people are coming together. Many people (language implementers, engine implementers) are coming here with existing codebases. When looking at compiling to WASM or implementing WASM, we discover choices in our codebases that make it difficult to work with WASM. Some things that were before considered acceptable in the existing codebase now present as technical debt. Some choices are very fundamental to the codebase and it is not immediately obvious whether or even how we can undo and rework the codebase at a reasonable cost. So we all try to understand how to work with the spec in a way that is maintainable in the long run. (c) all involved here care deeply about performance. With WASM being a melting pot for all kinds of languages, none of the language implementers can expect to get a WASM that caters optimally to all concerns of their language (or, for that matter, a WASM that is trivial to compile to). There are so incredibly many constraints and requirements coming from all the languages overall. Still, we consider reasonable tradeoffs in performance worth it in the overall scheme, the premise of being able to run code anywhere by compiling to WASM. One important task of WASM engine implementers is to watch over the specification and look for features that introduce unnecessary performance costs, while all the compiler implementers try to lobby for and ask for features that they can make use of (ideally, while keeping in mind the overall goal of making WASM an efficient compilation target for many languages). The goal here for WASM GC is to strike a balance between WASM GC being small and fast and providing the expressivity that makes it a reasonable compilation target. (d) resource constraints matter for all of us. Decisions need to be made and we all need to put only so much on our plates that we can still handle that and get something done. I appreciate very much that WASM is not some huge gigantic beast, but a rather compact language whose specification can be read and mostly understood in a very short time.

I'm trying to summarize and sometimes comment on some of the points brought up. If you feel that I missed your point or something needs clarification, do speak up. @rossberg says that first-class polymorphism, of which OCaml has plenty (polymorphic methods and record fields, first-class modules, GADTs, ...) makes it impossible to do static code specialization to the point WASM GC without i31ref requires. OCaml compiler implementers that I asked seem to share this idea and provided some hard-to-monomorphize examples. I know too little about all the features of OCaml at this point (I have been working with the compiler only since September part time and I still have a lot to learn.) to say whether full-program monomorphization is possible or not - I do not have a proof for either of these hypotheses. @rossberg do you know good papers or theses on this topic? @gasche says that type-directed or shape-directed unboxing, supported by code specialization is a fairly sophisticated technique for compilers to implement, while, in contrast, having an efficient way to "box" immediate scalars without going through the heap is easy for backends to allow, and fairly easy to compile into for types that fit the smaller width. I agree with your assessment, that OCaml, in particular, chose to go with not many optimizations by choosing a memory model that brings 80% of the performance to be expected to the table. The implicit assumption that the memory works this way, that you can put an unboxed integer into a memory cell in a heap block is being relied on in large parts of the compiler (in particular those parts where enough optimization has been done in order to make them a worthwhile starting point for compiling to WASM). @aardappel says There seems to be a lot of default fear of (5), that it going to result in unreasonable code bloat. Looking into MLton, I believe now that code size with monomorphization is likely not the major problem, as the bloat is also offset by the ability to do dead-code analysis in a whole-program compiler. @gasche says Many language runtimes use representations that are either an immediate word or a pointer, with tagging to distinguish the two. Indeed, we have the same representation in OCaml. type t = X | Y of int is a type that has two variants, X and Y. The variant X, which has no additional data attached, is represented as an integer, while Y is represented by a heap block with a field (tag) that says it's Y and another field that holds the attached integer value. This representation is introduced so early in the compiler that we do not have the information to undo it at a later in the compilation pipeline where we could branch off to WASM with reasonable implementation cost. I could call this either "technical debt" or a natural consequence of the choice of memory model permeating the whole compiler. There are people from the OCaml compiler team who believe that it should be possible to rewrite the compiler to introduce a different memory representation for variant types, but it will require a lot of effort. @rossberg says that WASM should support a mapping for fundamental compilation techniques of polymorphic languages. As someone who wants to bring OCaml to WASM, I'm obviously very much in favor of that. In contrast, @aardappel sees WASM GC MVP as a target aimed at languages that can compile via static code specialization. Which, by looking at all the parts of the spec apart from i31ref seems mostly true.

Concerning the topic of separate compilation of modules: So far, we are not aware of any problems with separate compilation of modules for OCaml, in the presence of i31ref. Obviously, in a MVP compiler backend, to a MVP language running on MVP engines, only modules compiled with the same compiler version will be compatible. That's perfectly fine. No complaints whatsoever here. :smile: Users will understand.

To get back to the topic of alternatives to i31ref: When I think of an alternative to i31ref, I think of something that enables by some means a memory model that allows us to, with reasonable efficiency, (1) store heap pointers and integers in the same array and read them back out (2) in a way that we can, at runtime, check whether a value is an integer or a heap pointer so that we can write a compiler backend for the OCaml compiler that compiles to the WASM GC MVP with the resources we have at hand.

Traditional hardware architectures have a memory model that lets us fulfill these requirements by pointer tagging integers at the cost of losing one bit on the integer representation - but there may be other reasonable alternatives that I don't see right now, that work for us, and that other languages could make better use of than i31ref. Possibly the alternative can live outside of WASM GC MVP, but I am not sure about that at all. If there was some external tool with a memory model that fulfills (1) and (2) and that compiles to WASM GC MVP, with a long-term expectation of reasonable performance, we could work with that. I don't know how that could work, though.

I believe that these two requirements are all that we need in order to compile all the crazy polymorphism in OCaml. I think @rossberg is right that, OCaml, in the long term, can work with 30-bit integers or 31-bit integers, or even, crazy as it may seem, 29-bit integers.

Requirement (2) can possibly be lifted after a major refactor of the OCaml compiler (which is not 100% guaranteed to succeed, and that we don't have resources for right now, but at least there are people optimistic that it could work).

Having both requirements (1) and (2), it looks to me like I need to double-box integers in a MVP without i31ref. I'll elaborate on that later.

I'm very interested in other languages who would use i31ref to compile polymorphism (no matter whether the language is dynamically or statically typed). Even more so, I am interested in the perspective of languages who need to compile extensive polymorphism but do not see i31ref as useful, needing something else. Please comment and describe your constraints for the memory model.

However, I expect that as soon as i31ref makes it to the implementation stage, people will start asking for i63ref.

RossTate commented 4 years ago

Thanks for the great post, @sabine!

I think @rossberg is right that, OCaml, in the long term, can work with 30-bit integers or 31-bit integers, or even, crazy as it may seem, 29-bit integers.

I was just about to ask you this question, so thanks for beating me to it! This is extremely useful, as it's the kind of perspective we can only get from language implementers.

Can I ask you for some more of your perspective along this line? Something I've been wondering is that, if we do change the number of bits for guaranteed-unboxed integers, what's a non-arbitrary number to change it too? The number that comes to mind is 24 bits because that's 3 bytes (so just enough for RGB and a little more than enough for some Unicode encodings). So I'm interested to hear specifically your thoughts on whether that would work for OCaml? That is, do you know of any OCaml applications or implementation strategies that would suggest a useful lower-bound on the number of bits needed for unboxed integers?

fgmccabe commented 4 years ago

JS implementations faced a similar dilemma with the need to optimize simple integers. The dominant technique there seems to be nan-boxing; which in the language of i31ref would be equivalent to i52ref. TLDR: the discussion becomes moot when we migrate (which we no doubt will) to 64bit wasm.

aardappel commented 4 years ago

@rossberg

You don't know at its definition site which polymorphic function ends up being used that way, and with what degree of polymorphism. The composition can happen somewhere else entirely, e.g. in a different module.

It is my assumption that if you compile down to a single .wasm, there comes a moment where with static specialization you can know the vast majority of these cases. If your default way of working involves completely independent compilation to separate .wasm files, then yes, you more often cannot, but then again you're kind of bringing this upon yourself.

You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.

I was thinking of engine code, like the GC traversing objects. And checks can be more or less expensive.

@jakobkummerow

That is indeed the one place where adding i31ref has a non-zero cost: when ref.test or ref.cast or br_on_cast operate on a value that's statically typed as anyref or eqref, then before loading and comparing that value's type descriptor, they first have to check the pointer's tag bit. Given that the type descriptor check (which needs to happen with or without i31ref's existence) involves a load from memory, and the tag bit check doesn't, I'm confident that the additional overhead is negligible.

That bit check may involve a missed branch, but yes, for code that doesn't use i31ref (which is what I'm concerned about) you'd never get a branch miss. Conversely, one could regard loading the type field to have no cost if the subsequent code is going to touch all the following fields anyway.

I'll take your word for it for now that cost will be negligible, though I hope at some point we'll have some numbers, particularly a real world GC benchmark (not containing any i31ref values) with bit checks turned on and off. I'm a big fan of the "you pay for what you use" principle, in Wasm, and all of computing.

aardappel commented 4 years ago

@fgmccabe

TLDR: the discussion becomes moot when we migrate (which we no doubt will) to 64bit wasm.

It's not just about how many bits we can fit in a pointer, it is also about whether we want to force the need to check these bits upon languages that don't need it (or languages that may need it, for code where it can be statically known that it's not needed).

64-bit wasm is mostly about 64-bit operands to linear memory load/stores, and thus larger linear memories, not much else. The size of an anyref would still be entirely an implementation choice, and in theory, an implementation could be using 32-bits to represent anyref even while linear memory is running in 64-bit addressing mode.

Conversely, does the GC proposal put some expected upper bound on the amount of GC objects that can be addressed? If this can be >2GB, then likely it must use 64-bit pointers internally, and we might as well use i63ref ? ;) That would not be great for any remaining 32-bit CPU architectures we want to run on, but it is certainly is strange (and wasteful?) that in practice, almost all i31refs will sit inside a 64-bit pointer.

Actually, even an i63ref on a 32-bit architecture is not that expensive, since if the bit says its a pointer, it can simply truncate without a further check. It doesn't even need to load the 2nd half from memory. So maybe we should go for i63ref, if we believe 32-bit platforms are becoming less relevant for Wasm as time goes on.

gasche commented 4 years ago

@aardappel

Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore.

One could make the same argument about separate compilation when linking is done at the assembly level, or LLVM level, or JVM bytecode level, yet those are things that implementations do in practice, because whole-program compilation is extremely inconvenient in various ways. When you generate code for a library into a module, you don't know what client modules you will be linked against, so it is hard to tell what specialized instances you need to generate. It is possible of course to design whole-program three-shakers to reduce code size by propagating whole-programmation about usage, and a common practice in some communities, but that does not remove the importance in practice of also having an open-world compilation model.

@RossTate

From what I understand, [GHC does pointer tagging] because they found that algebraic operations were so common that it was more useful to embed the algebraic-constructor tag in the low bits of the pointer.

Haskell is in an exceptional situation due to the pervasiveness of laziness: many values are thunks, and GHC's evaluation model performs an indirect jump on inspection, even to find out that a value is already evaluated. The benefit of pointer tagging is not to avoid a dereference (which you need to do in most cases to get the constructor arguments anyway), but to avoid this indirect jump. The way they avoid the cost of excessive boxing is through very aggressive inlining and optimizations in general; again the "sufficiently smart compiler" approach, which I don't think should be held as a goal for language implementors.

Note that there is a lot more Haskell code written in the strict subset these days, so it is not completely clear to me that this pointer-tagging choice is still the right design. At the time it was introduced, pointer tagging provided a 5% performance improvement over a strategy without pointer tagging. At the risk of going completely into guesswork: this suggests that it is reasonable to consider implementing a Haskell runtime without pointer tagging, and get realistic performance. (You may even see interesting gains by using tagged integers, which may offset the absence of tagging.) On the other hand, when implementing a strict language, pointer tagging does not help, and not having tagged immediate values makes you fall off a performance cliff.

So the experience of Haskell developers is that either custom pointer tagging should be preferred over i31ref for better performance

I am not aware of a performance comparison between pointer tagging and immediate-integers tagging for Haskell programs.

or i30ref should be preferred over i31ref for better GC strategies.

Chicken Scheme uses more than one bit of tag for value shapes, but only immediate integers use the least bit set to 1. In this model you can have i31ref, and still there are extra tag bits available in the 0 case. (Not sure whether this would work for the Haskell runtimes you are mentioning.)

I believe that in practice, if you used much less than 31 bits for immediate values, people would not use this for an integer type (24 is too small for your main type of machine-length integers), only for other immediate values. This might be a workable compromise (in particular, maybe people want 60-plus integers nowadays anyway), but it is not completely clear what the benefits are.

@sabine

I know too little about all the features of OCaml at this point (I have been working with the compiler only since September part time and I still have a lot to learn.) to say whether full-program monomorphization is possible or not - I do not have a proof for either of these hypotheses.

Ahead-of-time full-program monomorphization is known to be impossible for languages that support polymorphic recursion, including OCaml and Haskell (but not SML, see MLton), Agda, Idris, etc. If you have a JIT you can monomorphize "on the fly", this is what the dotnet runtime does. This point was made in this earlier comment of @rossberg: https://github.com/WebAssembly/gc/issues/53#issuecomment-545889014

gasche commented 4 years ago

@RossTate here is another argument that you may find interesting, in terms of finding a "principled" argument for choosing one size or the other.

We are talking about splits in the data space of values that can fit one word. You may want to have tag bits/patterns for either pointers (GHC pointer tagging, or other tricks used by other implementations) or scalars (non-pointers). For example, the Chez Scheme runtime has a bitpattern for fixed-sized numbers but also for booleans, pairs (the "cons tag" mentioned in the discussion), symbols, vectors, etc. The GC only needs to know the fine-grained structure of the pointers, so from the GC perspective we may need many categories of pointers, but only one category of values (which language implementations can the subdivide with more tagging). Given that both sides may need tag space depending on the language, it makes sense to divide the space evenly between pointers and non-pointers: this is exactly the int31ref design.

sabine commented 4 years ago

@RossTate

if we do change the number of bits for guaranteed-unboxed integers, what's a non-arbitrary number to change it too?

I don't think that a non-arbitrary number of bits per se exists. In an ideal world, we would have garbage collection in hardware, and we wouldn't need to chop off bits from a hardware word in order to implement efficient garbage collection in software. However, a specification for garbage collection in hardware mostly faces the same challenges as the WASM GC spec: there being a lot of different ideas of how efficient GC should work like, with nearly every language bringing both some acquired tastes and some fundamental invariants to the table.

I agree with @gasche that, if the number of bits gets too low, we will not use this to implement integer types: the better trade-off in that situation is to implement the integer types via boxed values instead of these unboxed integers, work on optimizing that representation, and enjoying native 32-bit arithmetic.

Note, though, that the implementation of integer types is only one of the situations where guaranteed-unboxed integers are used in the data representation of OCaml. My impression is that the OCaml compiler can make good use of guaranteed-unboxed integers with less bits in most, if not all of these other situations. (I need to check back with people tomorrow to make a list of all the situations where boxing the unboxed is considered to be particularly bad and confirm if this assessment is correct.)

@gasche

Ahead-of-time full-program monomorphization is known to be impossible for languages that support polymorphic recursion, including OCaml and Haskell (but not SML, see MLton), Agda, Idris, etc. If you have a JIT you can monomorphize "on the fly", this is what the dotnet runtime does. This point was made in this earlier comment

Are there any resources that show how a simple JIT compiler that can monomorphize code that contains polymorphic recursion looks like? I do believe Andreas Rossberg when he says it's not possible to do ahead-of-time full-program monomorphization because of polymorphic recursion. I would like to understand the argument in detail, why this is the case, where exactly things cannot work when trying to monomorphize ahead of time.

gasche commented 4 years ago

Here would be an artificial example:

(* a fairly inefficient way to compute 2^n,
   by creating a full tree of depth n and counting its leaves *)
let pow2 n =
  let rec loop : 'a . int -> 'a -> ('a -> int) -> int =
    fun n v count ->
      if n = 0 then count v
      else
        (* call ourselves on values of type ('a * 'a) *)
        loop (n - 1) (v, v) (fun (v1, v2) -> count v1 + count v2)
  in
  loop n () (fun _ -> 1)

let () =
  (* The type of [v] in the last call to [loop] in the code below
     is determined dynamically  by the integer read at runtime.

     For example if we read 3, the last iteration takes a value of type
      (((unit * unit) * (unit * unit)) * ((unit * unit) * (unit * unit)))
     and returns 8. *)
  print_int (pow2 (read_int ()))
sabine commented 4 years ago

@aardappel

It's not just about how many bits we can fit in a pointer, it is also about whether we want to force the need to check these bits upon languages that don't need it (or languages that may need it, for code where it can be statically known that it's not needed).

Yes, I think this is a crucial point. In order to find out whether this feature is reasonable, we need a more thorough survey among the compiler teams of the candidate languages that could run on WASM and would use the built-in GC instead of bringing their own.

The hypothesis, as I understand the previous discussions, is that (a) languages that belong in the wider sense to the family of Lisp will be able to make good use of the feature, (b) some languages would use the feature for the sake of convenience/simplicity during compilation (but could ultimately compile to WASM with no major loss in performance even if it didn't exist), and (c) other languages will not use it at all, no matter what.

Now this becomes something that can be checked with the maintainers, to sort the languages into these three categories.

It would also be very helpful to have performance comparisons for languages that want/need this feature, with the feature and without it. This requires significant effort to be put into optimizing the compiler version that cannot use i31ref, though, in order to be a fair comparison at all.

At least, this could make it easier to decide whether to include it, and, if so, in what form.

gasche commented 4 years ago

A reference on the use of a JIT to perform type-directed specialization at runtime would be the article "Design and Implementation of Generics for the .NET Common Language Runtime", by Andrew Kennedy and Don Syme, 2001.

sabine commented 4 years ago

Here would be an artificial example:

(* a fairly inefficient way to compute 2^n,
   by creating a full tree of depth n and counting its leaves *)
let pow2 n =
  let rec loop : 'a . int -> 'a -> ('a -> int) -> int =
    fun n v count ->
      if n = 0 then count v
      else
        (* call ourselves on values of type ('a * 'a) *)
        loop (n - 1) (v, v) (fun (v1, v2) -> count v1 + count v2)
  in
  loop n () (fun _ -> 1)

let () =
  (* The type of [v] in the last call to [loop] in the code below
     is determined dynamically  by the integer read at runtime.

     For example if we read 3, the last iteration takes a value of type
      (((unit * unit) * (unit * unit)) * ((unit * unit) * (unit * unit)))
     and returns 8. *)
  print_int (pow2 (read_int ()))

So, in every call to loop, the type of v is different, first it is () (the unit value), then a pair of unit values, a pair of pairs of units. Yes, that looks like the loop function cannot be monomorphized, and, since we are getting the number n from an external source, we cannot even perform static analysis to know all the inputs that loop is going to be called on.

Admittedly, this particular example is not very representative of real world code, but the general principle that there are functions that can be called recursively with different types applies.

jakobkummerow commented 4 years ago

@fgmccabe

The dominant technique [among JS implementations] seems to be nan-boxing

Not sure how you define "dominant"; V8 at least uses pointer tagging, almost exactly like i31ref. Needless to say, this is purely an internal implementation detail, and not at all spec'ed in or visible to JavaScript. (Which is making me think that other dynamic languages, when compiled to Wasm, might want to use i31ref similarly, with dynamic checking for any given integer value whether it fits or needs to be boxed -- but this is just speculation on my part.)

@aardappel

I hope at some point we'll have a real world GC benchmark (not containing any i31ref values) with bit checks turned on and off

I agree that that would be an interesting data point; I'd also like to point out that probably none of the browser Wasm engines are going to provide this data point: whether through pointer-tagging or nan-boxing, they all have unboxing techniques already in widespread use in their GC implementations, so when they add Wasm-GC support (with or without i31ref specified, and with or without i31ref actually occurring in the running Wasm module), they won't be able to turn that off. In other words: in the browser engines, the GC must check for pointer tags anyway as it walks the heap, so in browsers we're getting this theoretical cost of i31ref support for free. (Disclaimers: I can only speak about V8 with certainty, but I'm pretty sure that this holds for Spidermonkey and JSC as well. Of course I'm aware that browser engines aren't the only game in town, so I'm just providing one perspective here.)

in theory, an implementation could be using 32-bits to represent anyref even while linear memory is running in 64-bit addressing mode

Not just in theory; that's almost certainly what we'll be doing in V8.

does the GC proposal put some expected upper bound on the amount of GC objects that can be addressed? If this can be >2GB, then likely it must use 64-bit pointers internally,

I don't think we've talked about limits yet; it's fair to assume that there will be a limit, just like the Wasm JS API spec already defines a bunch of other limits. Note that the theoretical limit of managed objects that are addressible with a 32bit pointer (even tagged) is not 2GB; with 4-byte object alignment it's trivially easy to support 4GB heaps with tagged 32bit pointers (V8 has that support today, though our default heap size limit is lower); with 8-byte alignment and an appropriate "pointer compression" design (left-shifting to decompress before each access) one could go as far as supporting 16 GB heaps (at a certain performance cost, which may or may not be worth it for saving half the memory of all pointer-sized fields).

and we might as well use i63ref ?

V8 uses 32-bit pointers everywhere (yes, even on 64-bit desktop browsers, because that saves massive amounts of memory), so I assume that i63ref would experience significant pushback, to put it gently :-) (Because it would force all pointers to be 64 bits wide, by virtue of being a subtype of anyref and demanding an unboxed implementation.)

sabine commented 4 years ago

@jakobkummerow Thanks for the insights on the existing V8 codebase.

So,

  1. existing GC implementations of browser engines generally use either pointer-tagging or nan-boxing already. At least in browsers, we are going to bear the cost of i31ref, no matter if it is implemented as part of the WASM GC or not, since it is certain that the WASM GC will be implemented by means of the existing GC.
  2. In contrast to i31ref, i63ref is not always easy to support in the existing codebase because it forces all engines to use 64-bit pointers. Pointer width should be an implementation detail of the engine.

Anything to add to this, from your side @lukewagner?

rossberg commented 4 years ago

@RossTate:

Can I ask you for some more of your perspective along this line? Something I've been wondering is that, if we do change the number of bits for guaranteed-unboxed integers, what's a non-arbitrary number to change it too? The number that comes to mind is 24 bits because that's 3 bytes (so just enough for RGB and a little more than enough for some Unicode encodings). So I'm interested to hear specifically your thoughts on whether that would work for OCaml? That is, do you know of any OCaml applications or implementation strategies that would suggest a useful lower-bound on the number of bits needed for unboxed integers?

I have several comments, some of which have already been made by others:

In summary, to decide on a smaller value, we should at least have some plausible scenario by which a Wasm engine could actually benefit from using more than 1 bit to distinguish scalars from pointers.

As an aside, one additional implementation detail I just remembered about OCaml is row polymorphism for objects and variants. Both are implemented with hashes over method/constructor names, which I believe are 31 bit. The hash function has to be the same across architectures because it affects what types are allowed (to rule out ones with hash collisions). So going below 31 might make objects and variants more difficult to implement and much more expensive to use. Probably not a show stopper, but at least a problem.

@aardappel:

If your default way of working involves completely independent compilation to separate .wasm files, then yes, you more often cannot, but then again you're kind of bringing this upon yourself.

Are you assuming that everybody should be doing whole program compilation? That clearly is not what Wasm intends to impose, and it generally is not a scalable approach to development or deployment. In any case, it doesn't solve the problem either, as others have shown above.

I hope at some point we'll have some numbers, particularly a real world GC benchmark (not containing any i31ref values) with bit checks turned on and off. I'm a big fan of the "you pay for what you use" principle, in Wasm, and all of computing.

AFAICT, the vast majority of GC runtimes makes use of tagged integers. And even some prominent ones that don't have second thoughts about that (https://blogs.oracle.com/jrose/fixnums-in-the-vm).

In any case, you are probably worrying about the wrong cost. It has actually been observed that unboxing can make GC slower than tagging, not faster (for example, see https://xavierleroy.org/publi/unboxing-tic97.pdf). Mainly because you need to carry around and inspect more type descriptors, which has repeatedly been observed to be costly. So if there is a cost to worry about wrt Wasm GC, then it is the cost of supporting heterogeneous unboxing in the form of structs, and the need for complex type descriptors that induces.

Of course, as @jakobkummerow points out, none of that will be measurable in Web Wasm engines, because they've already sunk the cost of both mechanisms for JS's sake.

sabine commented 4 years ago

The hash function has to be the same across architectures because it affects what types are allowed (to rule out ones with hash collisions). So going below 31 might make objects and variants more difficult to implement and much more expensive to use. Probably not a show stopper, but at least a problem.

It is true that having less than 31 bits would require work on the OCaml compiler and that different, and (due to having less bits) more hash collisions would occur. People assume that this is ultimately solvable in some way or another (people compiling OCaml to WASM getting used to the quirks and/or taking some acceptable degree of performance degradation).

We would clearly prefer the ability to embed 31 bits into anyref (and also 63, but asking for that would be too much in the light of this being not implementable in the short, and possibly even the longer term :smile:). If there were important technical reasons that make 31 bits impossible to commit to, we would understand.

In any case, you are probably worrying about the wrong cost. It has actually been observed that unboxing can make GC slower than tagging, not faster (for example, see https://xavierleroy.org/publi/unboxing-tic97.pdf)

It seems to me like we currently do not have a simple way to do specify efficient type-directed unboxing in WASM, with as good results and as little implementation effort as the i31ref solution.

I suspect that a solution for type-directed unboxing on WASM that is reasonably simple runs the very real risk of being not expressive enough for polymorphic languages, while at the same time being just a convenience feature for languages that can already monomorphize their code.

RossTate commented 4 years ago

Meta-note: I am not engaging in comments on whether to have unboxed integers. While there have been a number of informative remarks on this topic, it is separate from this discussion on considering alternative implementation strategies for OCaml if there are no unboxed integers. Exploring this discussion will help inform whether to have unboxed integers (as we may discover there are or are not good alternatives for OCaml), but the other way around is not true. Please create a separate issue on whether to have unboxed integers if y'all would like to continue that topic.

Monomorphization

Let's examine how this idea might work out in more detail.

Polymorphic Recursion

@gasche gives an example illustrating how polymorphic recursion can cause a type parameter to represent an unbounded number of surface-level types. However, for the issue at hand, we need only consider the low-level abstractions of those types. In particular, it seems there are two low-level abstractions for OCaml's types necessary for the issue at hand: scalar and reference.

For reference, I've repeated @gasche's example here: (By the way, Section 5.1 of this paper provides a real-world example.)

(* a fairly inefficient way to compute 2^n,
   by creating a full tree of depth n and counting its leaves *)
let pow2 n =
  let rec loop : 'a . int -> 'a -> ('a -> int) -> int =
    fun n v count ->
      if n = 0 then count v
      else
        (* call ourselves on values of type ('a * 'a) *)
        loop (n - 1) (v, v) (fun (v1, v2) -> count v1 + count v2)
  in
  loop n () (fun _ -> 1)

let () =
  (* The type of [v] in the last call to [loop] in the code below
     is determined dynamically  by the integer read at runtime.

     For example if we read 3, the last iteration takes a value of type
      (((unit * unit) * (unit * unit)) * ((unit * unit) * (unit * unit)))
     and returns 8. *)
  print_int (pow2 (read_int ()))

We can monomorphize this at the low-level by having two forms of loop: loop_scalar and loop_ref. The implementation of loop_scalar would recursively call loop_ref. The implementation of loop_ref would also call loop_ref. The implementation of pow2 would call loop_scalar. It's possible I messed up, but I believe the implementation code resulting from fleshing out that strategy would all type-check in WebAssembly (using the current GC proposal).

Code Bloat

Monomorphization would require every data structure and function to have 2n variants where n is the number of type parameters (but not existentially-quantified module types—more on that later). For very large functions, this can be mitigated by using having scalar versions call reference versions with boxed scalars if necessary. If this is impractical for OCaml, which only has two low-level abstractions for each type, then it seems inconsistent to expect C# to monomorphize and JIT (and the current proposed alternative for C# includes having integers being boxed).

Higher-Rank Polymorphism

OCaml supports higher-rank polymorphism through its records' field types. This might be addressed by having a field for each monomorphization. It can also be addressed by having just one field and using something like dispatch tags with the dispatch_func extension so that the appropriate funcref can determine which specialization to defer to based on the dispatch tag that was chosen by the callee according to the monomorphization they need.

Modularity

The current proposal does not support import/export of non-reference types. So any module importing types can assume their reference types, and any module exporting scalar types will unfortunately have to box all scalars. This boxing would have to happen anyways if you wanted to guarantee abstraction.

Summary

All in all, there are some weaknesses, but so far it seems to me like this strategy is at least plausible. There are also some potential plusses. Instead of 32-bit integers, this strategy could alternatively let you do 64-bit integers and 64-bit floats without boxing. I'm curious to hear your thoughts on this analysis, @sabine.

titzer commented 4 years ago

I'm a bit late to this discussion but I'll offer the following datapoints:

(The above said, I love whole program compilation! Virgil uses it, combined with polymorphic reachability and specialization as its default implementation strategy. On a big program, the compiler itself, about 30% of the final binary corresponds to specialized code originating from < 5% of the source code being polymorphic; mostly collections and utilities. Its prototype implementation strategy is to specialize up to machine representation, and this reduces that 30% to about 10%. Virgil targets wasm too, but when targeting this GC proposal, it cannot share as many specializations because doing so loses static type information and would force the introduction of casts.)

Let me also point out the elephant in the room:

Which brings me to an important point,

I generally think that i31ref is pretty-alright, not necessarily because it is great for compiling polymorphic statically-typed languages, but because it will be extremely useful for dynamically typed languages. What we are seeing here is just another instance of the pattern where static typing cannot be expressed properly so we fall back on dynamic typing.

Last point.

rossberg commented 4 years ago

@RossTate: I believe you are still overlooking essential parts of the larger picture.

For example, what about abstract types? Consider this example:

module type S = sig type t  val xs : t list end
module A : S = struct type t = int let xs = [0; 1; 2] end
module B : S = struct type t = string let xs = ["foo"; "bar"] end
module M = (val if random_bool () then (module A : S) else (module B : S))

When somebody passes M.xs to a polymorphic list function, say hd, what do you do? This is existential types. To solve this, you cannot avoid passing runtime type information in and out for both universal and existential types. And because all these features compose, you’ll end up having to do that everywhere where flow/escape analysis cannot prove that type witnesses are completely contained.

Such type passing implementations for polymorphic languages have been researched long ago and shown to have significantly more overhead than tagging.

As for whether that is even “plausible” for OCaml: in practice it is not. For all the reasons mentioned in this thread, OCaml’s compiler is designed for a uniform representation and hence happily uses untyped IRs. That is, none of the relevant phases would even know where polymorphism occurs and where to insert type or representation witnesses. You’d essentially need to redo its entire compilation pipeline. Similarly for most other languages in this class. (That is also the reason why most of them never saw a usable port to the CLR, and we only have F#, which is much less expressive.)

But it doesn’t stop there. You also have types with heterogeneous representations, something which all of the suggestions so far have completely ignored. For example:

type t = A | B | C | D of int * int

You naturally want to represent constructors A to C unboxed, but D clearly has to be boxed. When you have a list of t, or instantiate a polymorphic function with t, then it has to be able to handle both kinds of values at the same time. Short of tagged scalars you will be forced to box all nullary constructors, which would penalise a lot of typical code patterns — datatypes are ubiquitous in functional programming.

At this point I expect you to argue that we should introduce variant types into Wasm to address that. But that’s not good enough, because you also have row polymorphism, where the set of constructors is open and can be extended arbitrarily! Unless you also want to make that primitive in Wasm, representing variants with tagged scalars is the only practical way to unbox their constructors.

And I could probably go on.

The bottom line is that a monomorphising implementation of polymorphism is the analogous to a coercive implementation of subtyping. It has all the same problems that you’re familiar with regarding compositionality, scalability, etc. Probably worse, because it’s more general.

And none of this is limited to OCaml. You'll have similar problems with most other polymorphic languages.

[Hopping on the train now and will be off the grid for the next week or two.]

sjrd commented 4 years ago

@titzer

Type erasure has been an absolute fiasco in the JVM world. We cannot repeat that mistake. This is primarily because Java programs, in fact, all object-oriented languages that mix generics and inheritance, do not actually have the parametericity property that is assumed by a lot of theoretical work and functional languages. Simply put, generic Java programs in practice rely heavily on type information that only comes from their type arguments. Scala has gone through many incarnations of reification to work around this. I don't know where it landed, but its clear that this problem is not going away. The end result is that these programs work around specialization by a bunch of ad hoc mechanisms that amount to passing dynamic information about types at runtime.

In Scala we ended up rolling back virtually all aspects of reified types at runtime. We have learned to embrace type erasure at runtime, while taking advantage of typeclasses and their type-based resolution at compile time. The only remaining uses of runtime type information are what we call ClassTags, and they are only necessary for generically working with Java arrays, the one thing that Java does not use erasure for.

So the ultimate experience of Scala is that type erasure is a success, not the terrible fiasco that you are presenting.

titzer commented 4 years ago

@sjrd

I am glad that Scala found an expressive solution that works around the JVM's limitations. Nevertheless Java itself still suffers from expressivity problems because of, e.g. unchecked casts involving generics. We'd like to avoid hampering the design of future languages targeting Wasm as well as support existing languages like C# in a way that is competitive with their existing native implementations.

RossTate commented 4 years ago

Oh shoot, I did forget about functors. That does complicate things. That's where more whole-program considerations would have to come in. 😕

type t = A | B | C | D of int * int

You just have A, B, and C each be some singleton reference. For pattern matching, you could either compare with each singleton reference one by one, or those references could each have an i32 field indicating the case that you could switch on.

And none of this is limited to OCaml. You'll have similar problems with most other polymorphic languages.

I went through the top 50 in the TIOBE index. I might have missed some, but here's a rough survey:

Most of these languages use boxed integers when interacting with polymorphic/generic functions. So this survey suggests that, at least for an MVP implementation, boxed integers is a reasonable alternative for OCaml.

sjrd commented 4 years ago

We'd like to avoid hampering the design of future languages targeting Wasm as well as support existing languages like C# in a way that is competitive with their existing native implementations.

If you'll allow me to give another page of the Scala experience book: about 10 years ago we also tried to compile Scala to .NET. That effort, despite significant (multi-person-year) engineering, did end up as a total fiasco, which was completely abandoned, never to be picked up again. Why? Because of the reified types of .NET. Reified types in the VM might be great if the source language and the VM agree about their type system. In the case of Scala, there were minor mismatches between its type system and that of .NET. This eventually meant that Scala and its type system simply could not be compiled accurately to .NET, because .NET would insist that its type system was reified, and it entered in conflict with Scala.

At the opposite end of the spectrum, the compiler from Scala to JavaScript is a huge success, which maintains semantic equivalence with the original Scala to such a large extent that all major libraries cross-compile with virtually nothing else than build tool configuration. That's on a platform that only has dynamic types, the complete opposite of reified types.

So the experience of Scala is that targeting a VM with specialization is much harder, sometimes impossible. If you'd like to avoid hampering the design of future languages, which might have a type system that does not completely agree with your choice of type system today, my advice would be not to reify types.

sabine commented 4 years ago

This is the boxing solution, I came up with so far:

(global $boxed_unboxed_integer_tag i32 (i32.const 245))

(type $boxed_unboxed_integer (struct (field $value i32)))
(type $value (struct (field $tag i32) (field $contents (ref $block_contents))))
(type $block_contents (array (mut anyref)))

(func $is_int (param $x (ref $value)) (result i32)
  (local.get $x)
  (struct.get $value $tag)
  (global.get $boxed_unboxed_integer_tag)
  (i32.eq)
)

(func $unbox_boxed_unboxed_integer (param $a (ref $value)) (result i32)
  (local.get $a)
  (struct.get $value $contents)
  (i32.const 0)
  (array.get $block_contents)
  (ref.cast anyref $boxed_unboxed_integer)
  (struct.get $boxed_unboxed_integer $value)
)

(func $box_boxed_unboxed_integer (param $x i32) (result (ref $value))
  (global.get $boxed_unboxed_integer_tag)

  ;; create the inner box
  (local.get $x)
  (struct.new $boxed_unboxed_integer)

  ;; make an array of length 1
  (i32.const 1)
  (array.new $block_contents)

  (i32.const 0)
  (array.set $block_contents)

  ;; create the uniform value
  (struct.new $value)
)

Note that we need to double-box, because we do not know beforehand whether a given field of a heap block is a scalar or a reference. That's why we need a tagged block that we can do the Is_int check on. Maybe there is a less costly solution?

sabine commented 4 years ago

@RossTate

I went through the top 50 in the TIOBE index. I might have missed some, but here's a rough survey:

  • (Parametric) polymorphic languages whose predominant implementation uses unboxed integers: OCaml, SML
  • (Parametric) polymorphic languages whose predominant implementation does not use unboxed integers: Java, C++ (although not impredicative), C#, Swift, Rust (impredicative?), Dart, Kotlin, Groovy, Scala, Haskell

What about Lisp and Ruby? Lisp and Ruby, if I am not mistaken, belong to the former camp. Python belongs to the latter (everything is boxed).

There is no reason to limit this list to languages with strict typing and polymorphism, as the usefulness of i31ref is not necessarily tied to that.

It would be good if we don't include languages that are not, in their original implementations, using garbage collection as their overall memory management strategy (C++, Rust). I'll retract that suggestion if there is sufficient evidence that C++ or Rust would realistically be using the WASM GC to implement some of their memory management features.

@titzer

I generally think that i31ref is pretty-alright, not necessarily because it is great for compiling polymorphic statically-typed languages, but because it will be extremely useful for dynamically typed languages.

Maybe we should explore more on this point, how i31ref is going to be useful for dynamically typed languages, considering that they make up a significant share of the garbage collected languages.

@sjrd Thanks for the insights on Scala's effort of compiling to .NET. Entering into such a kind of heroic but ultimately futile effort is exactly what we as a compiler team worry about.

OCaml has stories related to compiling to LLVM (which, back in the days did not have the support for garbage collection that OCaml needed, but today might be a more suitable compilation target).

And, again, similar to Scala, compilers to JavaScript exist and are very successful, despite their shortcomings (e.g. exception handling).

rossberg commented 4 years ago

Oh shoot, I did forget about functors. That does complicate things. That's where more whole-program considerations would have to come in. 😕

FWIW, my example did not even contain a functor.

type t = A | B | C | D of int * int

You just have A, B, and C each be some singleton reference. For pattern matching, you could either compare with each singleton reference one by one, or those references could each have an i32 field indicating the case that you could switch on.

You could, but that's more complicated, especially for polymorphic variants, where you would now need a runtime system that maintains a centralised cache to canonicalise these singletons and achieve structural semantics.

And none of this is limited to OCaml. You'll have similar problems with most other polymorphic languages.

I went through the top 50 in the TIOBE index. I might have missed some, but here's a rough survey:

  • (Parametric) polymorphic languages whose predominant implementation uses unboxed integers: OCaml, SML
  • (Parametric) polymorphic languages whose predominant implementation does not use unboxed integers: Java, C++ (although not impredicative), C#, Swift, Rust (impredicative?), Dart, Kotlin, Groovy, Scala, Haskell

Okay, I never defined "polymorphic language", but it should be fairly obvious that I was referring to the breed of high-level languages with polymorphic type inference or something close to it, where polymorphism is ubiquitous and used heavily and implicitly. That would include ML, OCaml, F#, Haskell, Agda, Clean, and so on, but not Java or C#, etc, which have quite different use patterns. Scala is somewhere in-between. C++ and Rust don't even have GC, so are totally besides the point.

Among these, Haskell is a special case: due to laziness, it already has to pay the boxing price most of the time, so boxing ints by default is less painful (GHC also has unboxed ints, but they are not first-class). Nevertheless, other Haskell implementations have used tagged ints. For that reason, the language standard explicitly states that ints need only be 30 or 31 bit.

Languages like Groovy or Scala have specifically been built for the JVM, so didn't have a choice. Their performance also tends to be quite inferior to native functional language implementations, due to the inherent limitations and language bias of the JVM. One goal for Wasm was not to repeat such mistakes.

The Dart VM most definitely uses tagged integers. As do major JavaScript VMs, Smalltalk, Ruby, Scheme & Lisp, and other dynamic languages, calling them fixnums or smi's, among other names.

aardappel commented 4 years ago

@jakobkummerow (and @rossberg and @titzer who made a similar point):

I'd also like to point out that probably none of the browser Wasm engines are going to provide this data point: whether through pointer-tagging or nan-boxing, they all have unboxing techniques already in widespread use in their GC implementations, so when they add Wasm-GC support (with or without i31ref specified, and with or without i31ref actually occurring in the running Wasm module), they won't be able to turn that off.

I hadn't considered that, and that would indeed be a very pragmatic reason to support it. But I for one will be really sad that we're going to have our hand forced by JS implementation details on the design of a Wasm feature. Once we have i31ref, these "free" bit checks will never be able to be removed from any Wasm engine, ever, whether it also runs JS or not.

@rossberg

Are you assuming that everybody should be doing whole program compilation? That clearly is not what Wasm intends to impose,

It kinda does currently, if a single .wasm is to be the output. And if that is the case, languages might as well make use of it to compile their programs to more efficient code.

It has actually been observed that unboxing can make GC slower than tagging, not faster

I can totally believe that. I also still hope that GC is but a small slice of runtime cost, and accessing raw untagged values during the remainder is usually faster.

RossTate commented 4 years ago

@sabine

(type $boxed_unboxed_integer (struct (field $value i32)))
(type $value (struct (field $tag i32) (field $contents (ref $block_contents))))
(type $block_contents (array (mut anyref)))

Yeah, so there are a number of reasons for needing this triply boxed inefficient design in the current MVP.

  1. You don't want to use the given casting mechanism for distinguish between integers and other values because its very inefficient for such purposes. The casting mechanism has been designed with the expectation that everyone must share the same casting mechanism, and as such it is the worst of all words.
  2. The MVP does not provide a good way to mix scalar and referential data in a way that can be walked without knowing the precise type of the structure. Since OCaml has operations like structural equality/comparison/hashing, you're therefore forced to use an array of references as your $value type, and therefore you have to box integers to put them in that array.
  3. Arrays don't have headers in the current MVP because that's not strictly necessary; you can always do the boxing trick you're employing. Despite the fact that many languages would benefit from arrays with headers, and that arrays with headers would impose little-to-no cost, they have been excluded from the MVP based on the minimality principle. Yet somehow that same reasoning does not apply to i31ref.
  4. An i32ref was rejected for the MVP because it would be inconsistently boxed/unboxed across different architectures; consistently bad performance was preferred.

That said, you could optimize integers by making other operations slower: use anyref for $value and have your values be either boxed integers or arrays. Unfortunately this means you'll have to do a slow cast nearly every time you use a value, again because the MVP does not support case casts.

Note that i31ref would only fix this problem for OCaml int. An OCaml float would still be stuck with the above bad options.

sabine commented 4 years ago

@gasche

A reference on the use of a JIT to perform type-directed specialization at runtime would be the article "Design and Implementation of Generics for the .NET Common Language Runtime", by Andrew Kennedy and Don Syme, 2001.

This was very interesting. Though, it seems the described strategy would not work for OCaml. Early on, they state "We do not currently support the higher-order types and kinds that feature in Haskell and in encodings of the SML and Caml module systems,".

@aardappel regarding separate compilation: I don't think, at least for OCaml, there is a problem with separate compilation on WASM. While linking is not (yet) part of the official spec, it is entirely possible to do one of two things: (1) use the host environment of the WASM engine to link at runtime (in JavaScript, this is possible right now, I expect it has a noticeable impact on runtime performance, but shows that it can be done), or (2) implement a linker that links several .wasm files into a single one.

@RossTate

The MVP does not provide a good way to mix scalar and referential data in a way that can be walked without knowing the precise type of the structure.

I agree with this assessment. If we could, look into the tag field of a struct (always an integer), and then type cast to a (type (struct (field $tag i32) (field $boxed_integer_value i32))), if it is a boxed integer, this will help. I estimate this to be a pretty likely optimization because it is probably useful for a lot of Tiobe-index languages.

Despite the fact that many languages would benefit from arrays with headers, and that arrays with headers would impose little-to-no cost, they excluded from the MVP based on the minimality principle. Yet somehow that same reasoning does not apply to i31ref

That's an interesting observation. I suppose that this is because something like arrays with headers are an optimization that is highly likely to appear in the long run. I estimate that everyone is fairly confident that this will be taken care of in the long run.