WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly
Apache License 2.0
7.42k stars 735 forks source link

Recursive Types #3063

Closed tlively closed 2 years ago

tlively commented 4 years ago

The typed function reference proposal introduces directly recursive and mutually recursive types, and the possibilities for recursion are expanded in the original GC proposal, which @dcodeIO is currently working on prototyping in Binaryen. #3012 expands Binaryen's type system to support arbitrarily nested types introduced in the GC proposal, but it does not allow for the creation of recursive types. This issue sketches out my thinking on how we could build support for recursive types.

First of all, creating a recursive type requires that the type be available before its definition is constructed. That means we need a way to pre-allocate types, and because of mutual recursion, we will need to be able to pre-allocate an arbitrary number of types at once. The preallocated types should be in an uninitialized state that makes them unusable while their definitions are constructed. Then they should be "finalized" all at once, which should involve applying the type definitions then canonicalizing and de-duplicating the underlying interned type definitions, as happens today. Preallocated types that turn out to be duplicates should be freed and should not be leaked to the rest of the program.

Canonicalization of recursive types is going to be the tricky part. The way #3012 is currently written, only value_types as defined in the grammar of GC types are represented as Types, unlike in the WebAssembly text and binary formats, which give both value_types and def_types identifiers in the type section. That means we have to deduplicate slightly more complex structures, but doesn't change the fundamental problem much.

The state of the art for canonicalizing and checking equality of equi-recursive types like these seems to be described in https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.2276, which I have not yet had a chance to read. Should be fun, though!

tlively commented 4 years ago

Ok, so I read that paper. The gist is that it presents an algorithm for canonicalizing recursive types in O(n log n) time via a clever variable numbering scheme a la De Buijn indices. The downside is that the presentation in the paper is extremely dense and further complicated by the presence of universal types in the system the paper considers as well as discussion of many properties we don't care about. On the other hand, there are diagrams and examples and a complete low-level description of the algorithm, so I believe that with some work we would be able to translate that algorithm into code.

I also read another paper, https://open.bu.edu/handle/2144/1800, which solves the problem of canonicalizing, equating, and hash-consing recursive types by translating them to DFAs, minimizing the DFAs, then translating the DFAs back to recursive types. This paper was much easier to read, but that's mostly because all the complexity was offloaded onto DFA minimization algorithms presented elsewhere. There were also no diagrams or examples, so I'm not sure it would be any easier to implement this method. The algorithm this paper presents also requires a completely separate DFA implementation and has a time bound of O(n^2), so I think we should try to implement the method from the first paper first.

MaxGraey commented 4 years ago

What if represent equirecursive types as "infinite trees"? https://www.ps.uni-saarland.de/courses/seminar-ws02/RecursiveTypes.slides.pdf

So type equality of recursive types may be simplified to infinite tree equality. WDYT?

Another approach is using bisimulation and circular coinduction of streams: https://www.cs.ru.nl/bachelors-theses/2013/Robin_Munsterman___4070968___Equality_of_infinite_objects.pdf

dcodeIO commented 4 years ago

Wondering if we can get around a fancy algorithm due to binary and text format, by design, already giving us a canonicalized name or index upon reading, so we may just use a placeholder type for every name or index of an unknown type seen and canonicalize into our internal map once parsing is complete.

Directly recursive case:

(struct $a (field (ref $a))
;;       ^ $a = Struct()
;;                      ^ $a.fields += &$a

Indirectly recursive cases:

(struct $a (field (ref $b))
;;       ^ $a = Struct()
;;                      ^ $a.fields += &($b = Placeholder())
(struct $b (field (ref $a))
;;       ^ &&$b = Struct()
;;                      ^ $b.fields += &$a

I'm probably missing something? :)

tlively commented 4 years ago

@MaxGraey, yes, both those papers present concrete algorithms for checking equality on infinite trees. It's complicated because you can't actually calculate or store the infinite tree directly. My guess is that the bisimulation and coinduction approaches would be very similar.

@dcodeIO Unfortunately I think there are edge cases that will require more sophistication. For example, consider the following progressively more interesting cases. I use a slightly folded representation of the types for brevity.

(type $a (func (params (ref $a)) (results (ref $b))))
(type $b (func (params (ref $a)) (results (ref $b))))
(type $c (func (params (ref $c)) (results (ref $c))))

Here, $a and $b are syntactically identical, so it easy to see that $b is a duplicate and can be replaced with $a. This syntactic equivalence between types is also present in your example above. Performing the deduplication yields:

(type $a (func (params (ref $a)) (results (ref $a))))
(type $c (func (params (ref $c)) (results (ref $c))))

After that substitution, it is clear that $a and $c have the same structure and no references to additional types. An algorithm that represents self-references in a uniform manner would be able to figure this out.

But here's a more interesting example:

(type $a (func (params (ref $a)) (results (ref $b))))
(type $b (func (params (ref $b)) (results (ref $a))))

Here $a and $b are syntactically different. They still have the same structure, but you kind of have to squint at it and look at both types at the same time to figure it out. An algorithm that can represent self-references in a uniform manner would still have to look through the reference to the additional type and detect that there is a symmetry here, which is already getting complicated. But it can get arbitrarily worse!

(type $a (func (params (ref $b)) (results (ref $c))))
(type $b (func (params (ref $c)) (results (ref $a))))
(type $c (func (params (ref $a)) (results (ref $b))))

These types are also all equivalent, but they don't even include references to themselves that can be abstracted away by some canonicalization. I don't think there is a straightfoward algorithm that would be able to figure this out.

tlively commented 4 years ago

Oh, and all this is just for the initial hash-consing of the types that lets us check equality in constant time. I haven't even looked into calculating subtyping relations yet šŸ™ƒ

dcodeIO commented 4 years ago

Hmm, but looking at your examples, the only equality we need during parsing is of names/indexes, right? So after parsing, we have a set of potentially duplicate types with unique names/ids that we can deduplicate by first hashing the contents of the now complete types, and on collisions check if equal, and if equal deduplicate? Edit: Oh, just realized that one does not simply hash recursive types.

tlively commented 4 years ago

Yeah exactly, before you hash it you have to canonicalize it somehow. The papers I linked to both provide solutions for canonicalizing recursive types.

dcodeIO commented 4 years ago

That moment you realize that you need to collect potentially cyclic types in order to collect potentially cyclic garbage to ultimately overcome the cycle of prematurely assuming that there must be an easier way. Damn cycles. Cycles everywhere. Sorry for the noise :)

dcodeIO commented 4 years ago

Have been thinking about this a little while looking into other changes, and it appears that either some fundamental change to how we are canonicalizing types will be necessary or we'd need a separate TypeBuilder or similar doing some pointer magic just to be able to construct recursive types.

For instance, even if we'd have an API like NewStruct() creating a bare struct without any fields, and we'd then call AddStructField() to populate it, at some point we need to make the recursive reference involving a Type (pointer), which is not yet available as long as the full type has not yet been canonicalized, and we can't change an existing type because that'd change its hash that we already used to store it, I think?

In either case, i.e. fundamentally changing how types are represented internally or adding a builder doing the necessary C++ magic, this looks like something better suited to be done by someone with a clear vision of how all of this should fit together. Is there a chance that someone of you guys can pick this rather interesting task up? :)

tlively commented 4 years ago

Yes, I have some ideas about how to support creating recursive types. I can start on an implementation sometime this week :)

askeksa-google commented 3 years ago

Any progress on this? šŸ™‚

Canonicalization aside, it would be very useful for Binaryen to support reading and processing modules defining recursive struct types, as these arise naturally when translating many programs. It is currently blocking running Binaryen on the output from the Dart-to-Wasm prototype.

tlively commented 3 years ago

Thanks for the ping! There has been some progress, particularly the infrastructure committed in #3418. The remaining tasks are to incorporate that infrastructure into the binary and text parsers and to update various routines that traverse types to gracefully handle recursive types. It's good to know that this is blocking the Dart prototype from working with Binaryen; I will prioritize this work again.

tlively commented 3 years ago

Update: I have landed #3584 and #3588 that update the text and binary parsers to support use-before-definitions of types, so the "only thing" left is to figure out how represent and construct the recursive types themselves. Yes, that's the hard part, but at least all the code for actually using that functionality is done now. I don't see why we wouldn't have full recursive type support within a week.

tlively commented 3 years ago

Another update: Now that #3602, #3610, and #3624 have landed, recursive types should mostly work and be ready for experimentation. There is a known bug in which extra types are sometimes emitted into the output, but it shouldn't affect correctness and I have a fix pending in #3626. Please let me know about any other bugs you might find,

While the GC proposal currently uses equi-recursive structural types and the V8 prototype currently implements nominal types, Binaryen currently implements an ad-hoc hybrid system where self-referential type definitions (i.e. definitions of types that directly or indirectly refer contain themselves) are nominal and all other types definitions are structural. Despite the underlying weirdness here, the difference between nominal and structural types should not matter at all as long as you don't go out of your way to try detect it.

askeksa-google commented 3 years ago

Recursive types seem to be working, thanks! Binaryen can now successfully optimize some programs coming from dart2wasm.

I have now run into an issue with implicit upcasts from subtypes, though. This seems to be unrelated, so I have filed a separate issue.