Layer 1 encoding proposal

kg commented 6 years ago

I have a loose proposal for a layer 1 encoding. The basic idea is to split the current sequential stream-of-structs encoding into a set of smaller data streams - integer constants, function indexes, opcodes, etc. Doing this dramatically increases the compression ratio when compressing with weaker codecs like gzip (and is a slight size reduction for stronger codecs like brotli). This also creates opportunities for parallel decoding and better icache/dcache occupancy while decoding (in particular for varints).

I spent a couple days prototyping this a while ago and got it to the point where it can round-trip the Tanks Unity demo and one of the UE4 demos (with the minor caveat that you can't fully 100% round-trip wasm modules without using the same encoder, because encoders like binaryen seem to insert filler bytes and do other weird stuff to simplify the encoding process).

In my simple prototype (1.5k lines of C# to decode & encode wasm) it takes around 2 seconds to convert the UE4 demo (38.5mb) to my proposed format, and 1.4 seconds to convert it back to the .wasm format.

I believe my proposed format is compatible with streaming compilation and can trivially be integrated into existing decoders - you essentially just maintain separate read pointers for each data type, and when reading from layer 0 files all of those pointers are the same (and increment in sync). This works because the overall ordering of the data is not altered, the data is just sliced up and redistributed. Because this format is compatible with streaming that means you can also layer gzip or brotli transport compression on top of it without having to manually buffer it into memory before decompressing or compiling.

The current proposal produces respectable gains and I think there are ways to improve the file size further - I only tried a couple dozen things. Choosing what to split and how to encode it can have a pretty significant impact - for example splitting the 'flags' and 'offset' elements of the memory immediate into their own streams makes the post-gzip file slightly larger, and I'm not totally sure why that is. The bright side is that you can experimentally verify whether a change is a file size improvement very easily. I suspect some conversations with the designers of codecs like Brotli might lead to ideas on how to improve this further.

I'm including a basic illustration of the proposal below along with data from my tests.

msaw illustration

msaw graph

jfbastien commented 6 years ago

Very cool!

I don't think round-trip needs to be a goal here, especially for padding purpose. It's easier for most tools to use extra padding on varint but I don't see why a compressor would want to keep that. Are there other things that get modified in a round-trip? There's plenty of things we can reorder, but IIUC your tool doesn't have sufficient semantic understanding to do this, right?

When you split things into segments, where do those boundaries fall? Would it make sense to always have them fall at function boundaries, and pack together functions up to a certain size? My intuition is that the decoded output wants to then hand off to the compiler, so having breaks at function boundaries helps keep the pipeline simple. Even reorder function ordering if we wanted to help parallel compilation?

My guess it that the function definitions, followed by data sections, are the biggest things. Does that match what you get? It seems like each section type probably wants its own type of compression, and the most format-aware stuff will come from the function definitions.

Where do people think the best place to discuss this would be? It seems like a short intro in a CG video call would be a good start, and might help get compression-interested people to get on board with a feature.

kg commented 6 years ago

I don't think round-trip needs to be a goal here, especially for padding purpose. It's easier for most tools to use extra padding on varint but I don't see why a compressor would want to keep that. Are there other things that get modified in a round-trip?

Right, round-trip is mostly for testing purposes, since it lets me verify that everything is passing through correctly. It also makes sense to have a way to down-convert from layer 1 to layer 0 when working with older layer-0-only software, I think, so having a standalone implementation makes sense. I assume eventually you'd just handle all this in relevant parts of the existing tooling.

From what I saw round-trips were mostly modifying the encoding of varints, since there are multiple ways to represent them and binaryen seems to reserve extra space. I did find the UE4 demo loses like 400kb of stray bytes but I'm not sure what section they're from - probably a custom one, since it's not easy to verify that I implemented custom sections correctly.

There's plenty of things we can reorder, but IIUC your tool doesn't have sufficient semantic understanding to do this, right?

The prototype has basic semantic understanding of wasm (definitions of all the opcodes and their operand types, etc) but probably not enough for advanced optimizations. I did test some basic reorderings possible with this information but they had no positive impact. It's possible intelligent reordering would allow you to make the most common function/type indices smaller, this was the case in some of my older compression prototypes.

When you split things into segments, where do those boundaries fall? Would it make sense to always have them fall at function boundaries, and pack together functions up to a certain size? My intuition is that the decoded output wants to then hand off to the compiler, so having breaks at function boundaries helps keep the pipeline simple. Even reorder function ordering if we wanted to help parallel compilation?

For segmentation the approach I used in the prototype is to attempt to split segments every 512 functions when encoding the function table. Segment splitting only occurs if the current segment is over a size threshold, so you end up with a stream of reasonably-sized segments that aren't necessarily at consistent intervals. You might want to split segments for the other tables, but the function table is the most important one. Splitting between function bodies seems like the right approach because then the process of decoding a function will never span segments, so you can just bump pointers when reading. The segments act as a good unit if you want to compile in parallel for sure.

My guess it that the function definitions, followed by data sections, are the biggest things. Does that match what you get? It seems like each section type probably wants its own type of compression, and the most format-aware stuff will come from the function definitions.

The code and data sections are the biggest part of the test modules I used, yeah. With the exception of the UE4 test, which for some reason got uploaded to the net with a 9mb names section. Not sure why they didn't strip that.

UE4Game-HTML5-Shipping.wasm ...
Type: Wrote 8110 byte(s) (from 9401b of wasm)
Import: Wrote 12550 byte(s) (from 12546b of wasm)
Function: Wrote 113798 byte(s) (from 113794b of wasm)
Global: Wrote 43 byte(s) (from 51b of wasm)
Export: Wrote 15661 byte(s) (from 15658b of wasm)
Element: Wrote 376334 byte(s) (from 376331b of wasm)
Code: Wrote 23455629 byte(s) (from 23891164b of wasm)
Data: Wrote 6241551 byte(s) (from 6249699b of wasm)
name: Wrote 8841706 byte(s) (from 8841692b of wasm)
-19: Wrote 7 byte(s) (from 111b of wasm)
Elapsed: 02065.00ms
UE4Game-HTML5-Shipping.msaw

You can definitely come up with a unique approach to encoding and compressing each section. For most of the sections I just broke the table entries down into primitive types which produces a few streams that compress a bit better:

public struct export_entry {
    public string field;
    public external_kind kind;
    public uint index;
}

// ...

public export_entryEncoder (ModuleEncoder moduleEncoder) : base(moduleEncoder) {
    fields = GetStream("field");
    kinds = GetStream("external_kind");
    indices = GetStream("table_index");
}

public override void Encode (ref export_entry value) {
    Builder.Write(value.field, fields);
    kinds.Write((byte)value.kind);
    Builder.Write(value.index, indices);
}

In many cases you could compress those primitive streams better - you could potentially bitpack them or use huffman or something.

The code section is where most of the format-aware code is, since the opcode determines how many values are being read and what types they are. The opcode/expression specific stuff is about 1k lines of C# for the full parse-wasm/encode/decode implementation.

WebAssembly / design

Layer 1 encoding proposal #1180