WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.41k stars 695 forks source link

AST and layer 0 binary format design considerations #494

Closed kg closed 8 years ago

kg commented 8 years ago

Going to document this as an issue so we can have an ongoing discussion about it.

I've seen a lot of flux in the shape of the AST and some misunderstandings about what needs to go into the layer 0 version of the binary format and compressed layers. I suspect this is due to a lack of clear communication on my part about how this needs to work, so I'll lay some of these principles out here and explain why they're important:

One-off mechanisms should be avoided whenever possible AKA don't be clever

Inventing one-off ways to represent nodes is a problem because it complicates decoders and creates more space for bugs to hide. A great example here is how attributes for loads/stores are currently bit-packed in the v8 native format. This is something we may want to do as a part of compression, but it should absolutely not be done in layer 0.

The structure of nodes should be completely regular

With the exception of block nodes (see below), the number of immediates and child nodes in a given node should be completely fixed. Any variation in the # of immediates or child nodes should be factored out into separate opcodes or block nodes. This is extremely important. This also applies to entries in tables like the signature table.

Immediates should be completely regular as well

This is more obvious, but immediates for a given node should always be the same type and they should never be optional. This also applies to tables like the signature table. Being regular in shape also implies that immediates in layer 0 should not have variable size.

Block node considerations

Block nodes are an unavoidable exception compared to other nodes - they have a varying number of children, and some of them have a child count that varies based on external data (function calls). The way this works should be extremely consistent, and we want to constrain the number of these opcodes that exist. Not following these rules will make it extremely difficult to evolve the format (in addition to the general issues described above). For block nodes, with the exception of function calls all of them will have some sort of length counter in front of the children in the binary format, so keep that in mind from an efficiency perspective.

Nesting vs sequences

Nested nodes deduplicate better and in our model nodes with a fixed number of children will be more efficient than nodes with a variable number of children. This is important to consider when designing nodes like br_if or tableswitch - they change the degree to which the tree is nested which will really hurt us in terms of file size.

lukewagner commented 8 years ago

Being regular in shape also implies that immediates in layer 0 should not have variable size.

I disagree on this point. Variable-length encoding does not make immediates irregular; they can quite regularly be given a variable-length encoding. Moreover, even if equivalent over-the-wire payload-size savings could be gotten in layer 1, layer 1 must still be streamed into a layer 0 decoder and thus we must still consider performance implications. I don't advocate putting every tiny byte-saving feature in layer 0 using this reasoning, but variable-length immediates stand out as having both a significant overall impact and low complexity. As a strawman illustration: layer 0 could be XML and layer 1 could be the binary format; we wouldn't want to do this b/c even if layer 1 could achieve the same payload size, layer 0 would be significantly slower.

FWIW, other than variable-length immediates/opcodes and one other feature I've already mentioned, which allows an opcode in the opcode table to fix zero or more immediates, I don't know of any weird, hacky, or irregular features which would have significant-enough savings to belong in layer 0. So I don't anticipate a slippery slope here.

kg commented 8 years ago

Variable-length encoding does not make immediates irregular; they can quite regularly be given a variable-length encoding.

When I say regular in shape and size I mean that the # of bytes (excluding child nodes) is fixed.

Moreover, even if equivalent over-the-wire payload-size savings could be gotten in layer 1, layer 1 must still be streamed into a layer 0 decoder and thus we must still consider performance implications.

This isn't true, you just vary the implementation of the functions you use to read immediates. js-astcompressor-prototype demonstrates this. An efficient decoder can do everything at once, but keeping layer 0 simple means development/debugging tools don't need to do any of the work.

variable-length immediates stand out as having both a significant overall impact and low complexity

I think this needs to be supported with data.

As a strawman illustration: layer 0 could be XML and layer 1 could be the binary format; we wouldn't want to do this b/c even if layer 1 could achieve the same payload size, layer 0 would be significantly slower.

Why not? This is an assumption I'm not clear on. Layer 0 as text and layer 1 as binary would be an interesting and sensible decision, but we've been treating the text and binary encodings as separate thus far.

I don't know of any weird, hacky, or irregular features which would have significant-enough savings to belong in layer 0. So I don't anticipate a slippery slope here.

We have a number of irregular things in the current spec, like opcodes with optional arguments and opcodes with attributes. I called some of them out already.

lukewagner commented 8 years ago

This isn't true, you just vary the implementation of the functions you use to read immediates.

At least in the MVP and for possibly quite a while (until a layer 1 stabilized enough to be standardized), layer 0 would be the only thing implemented natively by the engine.

I think this needs to be supported with data.

This was measured on wasm-polyfill-prototype and the difference was significant. I have to run but I'll have to go back into the git history of BinaryEncoding.md to dig out the exact figure. IIUC, the v8 native binary format also uses variable-length immediates so you should be able to do an independent verification now as well.

kg commented 8 years ago

At least in the MVP and for possibly quite a while (until a layer 1 stabilized enough to be standardized), layer 0 would be the only thing implemented natively by the engine.

Does that matter, though? I thought the proposal was for layer 1 to stream through using Module Loaders, at which point the native decoder is just reading bytes. Varints at layer 0 is not especially relevant for something that only lives in memory and doesn't necessarily even need to be retained in a single large buffer.

This was measured on wasm-polyfill-prototype and the difference was significant. I have to run but I'll have to go back into the git history of BinaryEncoding.md to dig out the exact figure.

You measured the file size in a fully-compressed file, which is different than the question of what layer 0 should be like. Did you measure decode speed and source complexity with/without varints? Do you have a measurement for the risk imposed by complicating layer 0?

IIRC varints were a size win in your prototype, but so were other things we pruned out like immediate deduplication.

js-astcompressor's layer 0 stores opcodes as int32 and immediates at full size. Applying varints to every integer in the entire file on bananabread is only a pre-stream-compression savings of 7mb, and it doesn't make decoding any faster. Incidentally, most of the ints written in bananabread are opcodes, so most of the size reduction is from opcodes being encoded as one byte.

IIUC, the v8 native binary format also uses variable-length immediates so you should be able to do an independent verification now as well.

Yes, the fact that we're currently using some of these features is why I created this issue.

ghost commented 8 years ago
    One-off mechanisms should be avoided whenever possible ^AKA
    don't be clever

Good principle. First thought was to have a lower layer that decoded from the binary stream to a representation of the AST. For example from the binary stream to the s-exp. Then even sections could be encoded in this format. Gave it a try using polyfill-prototype-1 but deferred as there is a huge difference between polyfill-prototype-1 and the current design now. This might also help skipping over forms for dealing with future new operators.

Inventing one-off ways to represent nodes is a problem because it complicates decoders and creates more space for bugs to hide. A great example here is how attributes for loads/stores are currently bit-packed in the v8 native format. This is something we may want to do as a part of compression, but it should absolutely not be done in layer 0.

Good point about the real need to compress layer-0, and how much weight to put on this. Would like to explore making some attempt to encode efficiently at layer 0, for the use case that actually work at this layer and need to store it of stream if over a data channel. Since it is also the only official Specification layer for now it might also be the encoded-source-code.

    The structure of nodes should be completely regular

With the exception of block nodes (see below), the number of immediates and child nodes in a given node should be completely fixed. Any variation in the # of immediates or child nodes should be factored out into separate opcodes or block nodes. This is extremely important. This also applies to entries in tables like the signature table.

Sounds good. It's not clear that 'completely regular' would need to exclude a variable number of child nodes and this might also support some optional immediate arguments. You might be right, but I'd like to check this. For example, if there were good support for variable length arguments then 'attributes for loads/stores' might be optional arguments with high frequency defaults and it might be possible to allow more operators to accept a variable number of statements and avoid some block operators. Would like to explore some options here.

What I tried was introducing atomic numbers so that the const operators were not needed and to allow these in the general operator arguments positions.

      Immediates should be completely regular as well

This is more obvious, but immediates for a given node should always be the same type and they should never be optional. This also applies to tables like the signature table. Being regular in shape also implies that immediates in layer 0 should not have variable size.

Certainly agree to keep the fixed immediates 'completely regular'.

    Block node considerations

Block nodes are an unavoidable exception compared to other nodes - they have a varying number of children, and some of them have a child count that varies based on external data (function calls). The way this works should be extremely consistent, and we want to constrain the number of these opcodes that exist. Not following these rules will make it extremely difficult to evolve the format (in addition to the general issues described above).

It could also make it easier to evolve the format if there were a regular way that optional and variable-length arguments were encoded - a layer-0 binary packing/unpacking layer.

There were a good number of other objects of variable length in the polyfill-prototype-1 such as various sections. To make the format easier for evolve it seems useful to encode the sections in the same packing/unpacking formant rather then have them all using irregular binary packing formats.

    Nesting vs sequences

Nested nodes deduplicate better and in our model nodes with a fixed number of children will be more efficient than nodes with a variable number of children. This is important to consider when designing nodes like |br_if| or |tableswitch| - they change the degree to which the tree is nested which will really hurt us in terms of file size.

It would be interesting to me if the design can use this principle to optimize encoding and processing while also supporting the code being readable - perhaps people reading the code are aided by similar patterns and I think programmers in general accepts principle of encapsulation.

Should the layer-0 be encoded to regular byte boundaries, or could it be a bit stream? layer-1 compressors might require regular byte boundaries?

AndrewScheidecker commented 8 years ago

For block nodes, with the exception of function calls all of them will have some sort of length counter in front of the children in the binary format, so keep that in mind from an efficiency perspective.

An alternative to encoding the number of operations at the beginning of the block would be to use a special opcode to indicate the next operation will be the last in the block. This might be a little simpler than encoding a potentially large immediate integer.

kg commented 8 years ago

An alternative to encoding the number of operations at the beginning of the block would be to use a special opcode to indicate the next operation will be the last in the block. This might be a little simpler than encoding a potentially large immediate integer.

Yeah, a sentinel 'end-of-list' opcode is something I haven't tried yet. It's more questionable from a validation perspective, and it means you can't pre-reserve space for the child nodes, but it might end up being better.

In both cases the length value and the sentinel opcode will probably be the same size, though.

lukewagner commented 8 years ago

I thought the proposal was for layer 1 to stream through using Module Loaders, at which point the native decoder is just reading bytes.

Regardless of whether it's streamed or not, the memory traffic will still be 50% greater if the layer 0 binary is 50% greater. Also:

You measured the file size in a fully-compressed file, which is different than the question of what layer 0 should be like.

I measured before and after gzip. Looking back at the data I reported, it was 31% size reduction before compression (so that would correspond to layer 0), and 7% after gzip.

To be clear, these measurements were given 1-byte opcodes. Now, we must definitely anticipate >256 ops, which means that, for opcodes, we'll either need to:

js-astcompressor's layer 0 stores opcodes as int32 and immediates at full size. Applying varints to every integer in the entire file on bananabread is only a pre-stream-compression savings of 7mb, and it doesn't make decoding any faster. Incidentally, most of the ints written in bananabread are opcodes, so most of the size reduction is from opcodes being encoded as one byte.

I'm not sure if that's an apples-to-apples comparison given that overall js-astcompressor had bigger file size and slower decode time. It'd be useful to measure instead on v8-native.

ghost commented 8 years ago
An alternative to encoding the number of operations at the beginning
of the block would be to use a special opcode to indicate the next
operation will be the last in the block. This might be a little
simpler than encoding a potentially large immediate integer.

Yeah, a sentinel 'end-of-list' opcode is something I haven't tried yet. It's more questionable from a validation perspective, and it means you can't pre-reserve space for the child nodes, but it might end up being better.

In both cases the length value and the sentinel opcode will probably be the same size, though.

Might get away with a single bit per opcode for most cases. This would decrease encoding efficiency in some areas, but might be compensated for in other ways. For example, if and if_unless could accept a variable number of statements avoiding block. Separate opcodes might not be needed for the same operator with a different number of arguments, for exampler if and if_else could share the same opcode but have two verses three arguments. The packing and unpacking might all be managed in low level layer that is not too complex.

kg commented 8 years ago

A necessary clarification: I've seen a lot of confusion about what the terms 'layer 0', 'layer 1', etc actually mean.

Layer 0 is effectively the uncompressed IR, something suitable for use as input to a JIT, object files produced from compiling source files, that sort of thing. A functioning implementation need only decode layer 0 binaries as long as you have a tool somewhere that can convert other representations into layer 0.

Layers 1-N-1 are the IR with structural compression applied - changes to data layout (splitting individual data out into streams), changes to basic representation (i.e. varints vs fixed-size ints), and deduplication. In practice we'll probably only have a Layer 1, but there might be a second level of granularity here... there are some obvious parallels in terms of making JS source small, etc.

Layer N is the IR with stream compression applied on top of everything else. You take something that's already been structurally compressed and feed it through a general-purpose compressor like GZIP or Brotli. You could also use a special-purpose compressor, or a specialized variant of an existing one.

None of the 'layers' in design terms are "user space", and compression techniques very explicitly do not go* in layer 0. A production-quality native implementation would likely implement all of the layers in native code for optimal decode & compile times, but layering lets us start simple.

My old proposal had four layers: Layer 0, 1 and 2 as above, with Layer 3 being stream compression. The main gap between 1 and 2 was tree deduplication (nullary macros).

For future discussions, perhaps we should use a simple convention:

* It's possible there are a couple specific cases where we need to optimize layer 0's representation so that things aren't a total nightmare - for example, varints make the file dramatically smaller - but these are exceptions and they involve tradeoffs.

kripken commented 8 years ago

I'm surprised by the middle layers. I thought that on the web, we would initially have only layer 0 in the VM (and N from the browser's network stack), and implement those middle layers in "userspace". And later, we'd consider standardizing those things depending on which turn out to be most effective. Or, is what you described for the far future?

kg commented 8 years ago

If structural and stream compression are implemented in userspace, odds are wasm will have dramatically worse startup time than gzip'd js as a result. Memory usage will be worse too.

You are correct that our starting point is layer 0 in the VM, everything else initially outside (or not implemented at all). But that's a stopgap like the polyfill. So it's dangerous to classify any of the layers as "user space", essentially, even if we happen to implement them there.

And yeah, layer N can live in the network stack - you could use content-encoding GZip or content-encoding Brotli (once it exists) there if you wanted.

I also think you're correct that we don't have to standardize all the middle layers at once. It's just important not to convince ourselves that they will always be written in JS or wasm, because that constrains our ability to actually make things fast in the future. It needs to be possible for us to arrive at a state a couple years down the road where a browser vendor can implement every single layer in native code, without intermediate stages or temporary storage. That will enable really important use cases.

ghost commented 8 years ago

Unless the browsers add a streaming interface for receiving and compiling wasm layer 0 then implementing the structural organisation and compression in JS will not be useful for large code because due to the latency. I am sceptical that they will want to add a streaming interface, so am sceptical that any plan that has the stream going through user JS code is likely to be widely used. Rather I expect we just need to build in the flexibility to do the structural compression into the layer 0 binary format consumed by the runtimes, and the compression can just use the existing http transfer encoding support, or the web wasm runtimes might just be expected to integrate brotli etc. I think wasm will be DOA if it does not reduce latency from the beginning, and that it is wishful thinking that important aspects of the binary encoding design can be deferred.

kripken commented 8 years ago

@kg: thanks, now this is clear to me.

kg commented 8 years ago

@JSStats: arguably, design cannot be deferred, but layering lets us defer implementation, which is very useful for a process like this. And we still do have the awesome possibility that years in the future someone will come and take our Very Good layer 2 and add an even better layer 3 on top that can eventually be implemented in all the major browsers.

lukewagner commented 8 years ago

I am sceptical that they will want to add a streaming interface

It depends on the Streams API of course, but work is evidently underway. Given Streams, it is totally reasonable to expect a bunch of new stream sources and sinks to appear all over the web platform and streams have already been specifically discussed in the context of both the loader API and pairing the old proposed FunctionPromise API, so extending WASM.eval (or whatever we call it) to accept a stream would fit right in.

ghost commented 8 years ago

@lukewagner Interesting, thank you. If the browser vendors are happy with a streaming API to consume an inflated layer 0 wasm binary then that would help.

There is also the matter of the caching, which has all been discussed before with seemly little progress. It will hurt re-start performance if the user layer needs to re-run each time the code is re-started.

What about the blob that is transported and consumed by the upper user defined layers in JS?

The design hurdles just appear too great and the progress too slow to get something useful out soon, hence it seems that just bundling some structural organisation capability into the initial binary encoding and using brotli is going to be the practical outcome. A stretch goal might be to add a specialized prediction model to brotli for the AST.

But perhaps when people get the stack running their attention will change and some quick progress can be made here.

lukewagner commented 8 years ago

@JSStats Agreed that caching is another important story. One idea we've briefly discussed is that, in addition to the WASM.eval which takes, as input, an ArrayBuffer or Stream and produces a Promise resolving to an export object (whose fields are JS wrappers around the wasm exports), we could also have another API (or a flag to WASM.eval) that instead produced a Promise resolving to an "uninstantiated module" object which could be defined to be structured cloneable which means it could be (1) stored explicitly with the IndexedDB/Cache APIs, (2) postMessaged between workers and windows (allowing cross-origin caching). While it would be opaque, the idea is that the engine representation of these "uninstantiated modules" would literally be the machine code and metadata. (Due to internal browser events, the uninstantiated module would likely also need to keep around the wasm binary so it could regenerate new machine code as needed.) Given an uninstantiated module, you could instantiate it by calling a method (given the imports) and this method would return a new Promise resolving to the export object. Altogether, this could give the developer quite a lot of control over compilation and caching.

ghost commented 8 years ago

@lukewagner Thank you. It sounds like a workable plan, but only with that "uninstantiated module" support. The key is that the runtime must not depend on the user layers deterministically generating the same layer 0 input for the same upstream blob - that it be entirely the user layer JS code's responsibility to choose an uninstantiated module object to reuse or to choose to decompress the source blob again.

jfbastien commented 8 years ago

@kg great summary of layers. Could you add it to BinaryEncoding.md? It's something we've discussed in-person quite a bit and I'm afraid we're got lots of shared context that non-in-person folks have no way to page in. It may also be useful to add some "why?" to Rationale.md.

What you wrote is what I remembered, we seem to have consensus here (@lukewagner / @titzer?), so issue consensus → PR ↝ profit!

lukewagner commented 8 years ago

Yes, it'd be good to improve BinaryEncoding.md with some of this discussion.

flagxor commented 8 years ago

This goes in a bunch of directions and I believe has been subsumed by other more-fine-grain items. Closing.