WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 696 forks source link

Consider local variable indexes popped from a function end section. #511

Closed ghost closed 8 years ago

ghost commented 8 years ago

Assuming that a custom AST compression algorithm is not practical in the short term, there is still some simply binary support that could be considered to improve compression with typical web compression algorithms. One important factor is exploiting the copying of data in the ring buffer for LZ77 style algorithms (the other factor the coding efficiency), and one example that seems to really help is to separate local variable indexes into a separate section. We seem to want to work in function level granularity so these indexes could be emitted at the end of each function.

Here's an example using brotli and the current v8 binary encoding. Notice that the copying (Cnnnnn) is broken up with literals (L) where there are required differences in the local variable indexes.

AST                  position code length literal/copy
block                   1374 01 1 C00697
| 5                     1375 05 1 C00697
| set_local             1376 0F 1 C00697
| | 43                  1377 2B 1 L     
| | get_local           1378 0E 1 C00707
| | | 8                 1379 08 1 C00707
| set_local             1380 0F 1 C00707
| | 42                  1381 2A 1 L     
| | get_local           1382 0E 1 C00711
| | | 11                1383 0B 1 C00711
| set_local             1384 0F 1 C00711
| | 41                  1385 29 1 L     
| | get_local           1386 0E 1 C00715
| | | 12                1387 0C 1 C00715
| set_local             1388 0F 1 C00715
| | 40                  1389 28 1 L     
| | get_local           1390 0E 1 C00719
| | | 14                1391 0E 1 C00719
| loop                  1392 02 1 C00719
| | 1                   1393 01 1 C00719
| | block               1394 01 1 C00719
| | | 7                 1395 07 1 C00719
| | | if                1396 03 1 C00719
| | | | i32.eq          1397 4D 1 C00719
| | | | | get_local     1398 0E 1 C00719
| | | | | | 42          1399 2A 1 L     
| | | | | i8.const      1400 09 1 C00729
| | | | | | 0           1401 00 1 C00729
| | | | block           1402 01 1 C00729
| | | | | 9             1403 09 1 C00729
| | | | | set_local     1404 0F 1 C00729
| | | | | | 15          1405 0F 1 C00729
| | | | | | get_local   1406 0E 1 C00729
| | | | | | | 13        1407 0D 1 C00729
| | | | | set_local     1408 0F 1 C00729
| | | | | | 16          1409 10 1 C00729
| | | | | | get_local   1410 0E 1 C00729
| | | | | | | 43        1411 2B 1 L     
...

Moving the get_local and set_local indexes into a function-end section allows this section of the AST to be copied without interleaving literals.

block                  1088 01 1 C00592
| 5                    1089 05 1 C00592
| set_local            1090 0F 1 C00592
| | get_local          1091 0E 1 C00592
| set_local            1092 0F 1 C00592
| | get_local          1093 0E 1 C00592
| set_local            1094 0F 1 C00592
| | get_local          1095 0E 1 C00592
| set_local            1096 0F 1 C00592
| | get_local          1097 0E 1 C00592
| loop                 1098 02 1 C00592
| | 1                  1099 01 1 C00592
| | block              1100 01 1 C00592
| | | 7                1101 07 1 C00592
| | | if               1102 03 1 C00592
| | | | i32.eq         1103 4D 1 C00592
| | | | | get_local    1104 0E 1 C00592
| | | | | i8.const     1105 09 1 C00592
| | | | | | 0          1106 00 1 C00592
| | | | block          1107 01 1 C00592
| | | | | 9            1108 09 1 C00592
| | | | | set_local    1109 0F 1 C00592
| | | | | | get_local  1110 0E 1 C00592
| | | | | set_local    1111 0F 1 C00592
| | | | | | get_local  1112 0E 1 C00592

Just applying this strategy to all local variable indexes seems a good 10% win in compressed size (and neutral to the uncompressed size) but there are good hints that it could do better by using this selectively so that the producer would emit separate opcodes when the local variable references are to be popped from the function-end section. There are hints that doing something similar for integer constants would also help but it is not a win on my test code if applied to all uses. This might also be viewed as similar to simple macros with the compression algorithm doing the expansion by copying and the variation popped from a separate section by the consumer.

It also improved the literal coding efficiency a little on my test example. Brotli supports context switching and dynamic contexts based on the last two bytes and separating these indexes might have helped the coding efficiency - still a lot to explore there though.

Still pondering how to search for useful patterns on the producer side to optimize this, probably hashing lots of combinations, but the burden on the consumer might not be very high. Food for thought.

kg commented 8 years ago

Partitioning data into sections by type is part of the JSZap paper and my existing binary compression research. We'll be doing it.

ghost commented 8 years ago

Cool, thanks for the JSZap reference. Here's the link for others http://research.microsoft.com/en-us/projects/jszap/ It seems to make a good technical case for the AST which would be very relevant for the rationale unless there are IP issues - and since a decision to use an AST has already been made can we add a link to papers in the rationale? I saw some XML work that separate data too.

Perhaps the difference in what I am seeing is that making it an option for the producer to separate the indexes might help even more. In the example above some indexes vary but some do not, so perhaps the producer could only separate those that vary between copies. There is also an i8.const use in the above example that need not change, yet in other sequences it is not uncommon to see a constant change break up the copy sequence, and blindly separating all these constants degrades the compression ratio on my example code.

Hand written macros do not generally parameterize all literals and variables, just those that need to vary, and this common sense might translate into better compression too. It might just need some clever code to search for useful reusable sequences and how to best parameterize them and this would be a burden on the producer. Admittedly it might harm the coding efficiency keeping some of this data inline - something to explore. An upstream translator with macro support might even be able to exploit explicit macros by just separating the variable indexes that are macro parameters.

The other point being suggested here is to separate them per-function, so that consumers can still work per-function without waiting for all the binary file. Is this a decision we can make now and land in the design documents?

A key assumption is that a custom AST compressor is not being used, rather that the uncompressed AST is being optimized to work better with existing web compressors, and Brotli seems the best. Have any decisions been made here, or are there working assumptions?

kg commented 8 years ago

Separating some values but not others means we now need a flag on every value to determine whether it is separated. We could do it on a per-function basis at least because then the cost of the flags is negligible, but there would need to be data to support something that complicated. Compression efficiency is much better when you just have a bunch of fully homogeneous streams of values. It also becomes harder to reason about things when there is no particular logic to what stream a value is in.

As far as streaming decode goes, the way to handle that is to split each stream into chunks and interleave them. Then you get most of the compression advantage from splitting things into streams without paying a signaling cost or complicating the decoder too much. The known size up front for the chunks makes things much simpler to deal with.

There are no firm decisions on what stream compressor to use, but it's likely (in my opinion) that it will be common use either LZHAM or Brotli as they are the two best-performing choices and they are easy to integrate. Which stream compressor you use is largely outside of the spec, though. I suspect many people may just use gzip. I suppose it's possible that the fully compressed layer could specify a standard stream compressor - it's still unresolved how those bytes actually get into the native decoder, and at one point @lukewagner wanted everything to go through ES Module Loaders.

sunfishcode commented 8 years ago

For context, this technique is called "split-stream" compression, eg. here.

ghost commented 8 years ago

Brotli specification: https://tools.ietf.org/html/draft-alakuijala-brotli-07

Brotli outline: https://lists.w3.org/Archives/Public/public-webfonts-wg/2013Oct/att-0008/Brotli_Compression_Algorithm_-_outline_of_a_specification.pdf

Firefox Brotli support: https://bugzilla.mozilla.org/show_bug.cgi?id=366559

@sunfishcode Thank you for the reference. Note that separating the data into separate streams might not be essential. Two factors stand out: increasing repetition; and coding efficiency.

Just separating the local variable indexes seems to significantly increase repetition in the stream and this benefit does not depend entirely on separate streams.

Just focusing on Brotli: Huffman codes are used to efficiently encode various categories of symbols, and importantly many difference codes are supported in the one stream based on the context and changing context seems very lightweight. There is also dynamic context switching of literal codes based on the last two decoded bytes - also ponding if the the opcode numbering could be optimized better exploit this.

Practically I see a good improvements just moving the variable indexes to the end of functions, but have not yet found other categories to separate that improve the compression ratio.

jfbastien commented 8 years ago

Regarding Brotli, also see https://github.com/WebAssembly/design/pull/206

kg commented 8 years ago

Practically I see a good improvements just moving the variable indexes to the end of functions, but have not yet found other categories to separate that improve the compression ratio.

The question isn't whether it helps (it does), the point is more that doing these things as a one-off is an unnecessarily expensive way of optimizing file size and decode performance. If we do it right from the beginning, every part of the file format will benefit. Otherwise we end up having to incrementally change the underlying binary representation over and over, moving one thing into a stream and then another. And some encoders won't be splitting one thing out or another, so decoders grow a huge amount of complexity over time.

Just focusing on Brotli: Huffman codes are used to efficiently encode various categories of symbols, and importantly many difference codes are supported in the one stream based on the context and changing context seems very lightweight. There is also dynamic context switching of literal codes based on the last two decoded bytes - also ponding if the the opcode numbering could be optimized better exploit this.

Context modeling is a big part of what makes LZHAM and Brotli outperform gzip. If the 'context' of our data changes constantly by being interleaved on a per-opcode or per-function basis, that hinders the effectiveness of context modeling (and naturally, compressors without it won't have a good time either).

In my brief discussions with some of the Brotli people so far, they strongly encouraged us to split streams and make some other smaller decisions that will improve the effectiveness of Brotli and similar encoders.

ghost commented 8 years ago

@kg It would help to be able to reproduce any results.

Let's see, but perhaps a range of elements should have an option to be emitted ool and the burden might not be great. What are you proposing to put in separate streams?

We already need an opcode table, and extending it to also flag which arguments are to be moved ool is a small burden. I am experimenting with an opcode table that allows opcode variations with immediate constants and labels which avoids a layer of opcodes and with good results too and it would be a small matter to add another flag.

When you separate data into separate streams, does each have a separate decoder instance, all needing to run in parallel? If so then measuring the resource requirements might be important.

If you are seeing different results then it might be due to using separate streams with more total resources, and I would be interested to explore and reproduce this.

I would be interested to hear input from the Brotli developers. I see some issues with the dynamic context modelling not fitting well with a tree structure - the last two bytes might not be very relevant when the tree depth has stepped, rather the parent opcode byte might be a useful predictor etc. Also the dynamic context is limited to 6 bits of the last two bytes - we need to make sure that it can see variation in these bits when it usefully affects the code differences.

Thankfully, these do seem largely technical matters that can be settled on technical merit.

kg commented 8 years ago

@kg It would help to be able to reproduce any results.

All the necessary code and configuration is in https://github.com/WebAssembly/js-astcompressor-prototype . It parses and compresses JS parse trees (we didn't have sexprs then) but you can modify it to compress other trees if you want.

Let's see, but perhaps a range of elements should have an option to be emitted ool and the burden might not be great. What are you proposing to put in separate streams?

Everything :-)

We already need an opcode table, and extending it to also flag which arguments are to be moved ool is a small burden. I am experimenting with an opcode table that allows opcode variations with immediate constants and labels which avoids a layer of opcodes and with good results too and it would be a small matter to add another flag.

We're already considering multiple other flags in practice. Every flag we add halves the number of one-byte opcodes available, and in practice that is a pretty big sacrifice. Alternatively you reserve a single opcode as a flag to access a space of 'special' opcodes or encodings, but now all of those encodings cost an extra byte or two.

Every flag or trick we add to optimize a specific thing increases the complexity of every encoder, decoder, and tool in the toolchain. It's important for the design to be general.

When you separate data into separate streams, does each have a separate decoder instance, all needing to run in parallel? If so then measuring the resource requirements might be important.

Stream splitting is a very simple thing you do at your basic I/O layer. If you think about how you'd write a decoder for the v8-native format there are some really obvious seams where you can introduce the ability to split streams. A naive decoder just reads bytes out of the file one or two at a time - to stream split, you just maintain a small table of offsets into the file, one per stream. There are a few general ways to do this that works well - for example, splitting I32, I64, F32, F64, and I8 immediates into their own streams; splitting opcode immediates into their own streams; and splitting opcodes themselves into a stream.

If you look at the section-oriented design of the format as-is, we're already doing a basic form of stream splitting there, with function calls referring back to entries in the function table. The point is that if you generalize your approach, all of it is automatic and unambiguous: if 'i32.const' reads an i32 immediate, you read it from wherever the i32 immediates are.

It is possible to achieve more compression gains by being fancier about your stream splitting, though. It's possible to generalize that if you build more information into the schema so that decoders know how to pick the appropriate stream to read things from. I'd prefer to avoid anything like that lest we end up designing another Collada :-)

ghost commented 8 years ago

@kg Made a small start at reproducing your results with the v8 binary format and it seems to support your result that putting data in separate streams gives the best compression. Will explore some realistic side files. These results are for the small zlib benchmark.

asm.js polyfill-prototype-1. uncompressed 56339, Brotli 30153

Standard V8 encoding. uncompressed 97297, Brotli 31984

Emit local variable indexes and i8.const values at the end of each function in the same stream. uncompressed 97297, Brotli 29557

Emit local variable indexes and i8.const values at the end of the file in the same stream. uncompressed 97297, Brotli 28857

Separate streams (files) for local variable indexes and i8.const values. Main file: uncompressed 70104, compressed 14403 locals : uncompressed 19696, compressed 10895 u8 : uncompressed 7497, compressed 3464 Totals : uncompressed 97297, compressed 28762

The code to reproduce this is now at https://github.com/JSStats/wll