Investigate TreeRePair tree compression

Yoric commented 6 years ago

See http://www.eti.uni-siegen.de/ti/veroeffentlichungen/12-repair.pdf

Yoric commented 6 years ago

According to https://arxiv.org/pdf/1506.04499v1.pdf, TreeRePair seems to have the best tree compression around.

dominiccooney commented 6 years ago

The graphs on p. 10 and 11 seem to indicate that TreeRePair produces smaller output.

Yoric commented 6 years ago

You're right, I meant TreeRePair, rather than RePair. Fixing title and comment.

Yoric commented 6 years ago

This is being explored in branch treerepair.

Yoric commented 6 years ago

I have an experimental version, which gives us the following results:

While the format does reduce the file size before gzip compression, it's still not competitive.

dominiccooney commented 6 years ago

Not sure if this is useful, but I've done some analysis of how effective Brotli is on top of TreeRePair. Some knowledge of Brotli is assumed for brevity but please ask if something is unclear:

Methodology:

binjs-ref at d94ef3c08e70c5f87419fa9c03e34238491ef242
source is fb_bench rsrc.php/v3iufH3/yb/l/en_US/C4ZhzGqwQ96.js
cargo run --bin binjs_encode -- -i $THAT --format=trp -o trp-out
Compress the original source, and trp-out, with Brotli (admittedly an old binary) -q 11 -w 20.
Consider the whole file. (I can do analyses on parts of the file but in order to do that it would be super useful to have the encoder dump offsets into the uncompressed stream delineating sections.)
Note, the tool I used was designed for ASCII text, so the following says "char" but read "byte".

Uncompressed binast is smaller (640K versus 1.44M) but Brotli compresses it much less effectively (0.53x versus 0.15x) so the resulting compressed size is larger (339K versus 218K.)

In particular, binast/trp produces lots of literal symbol insertion and few copies. By source 59% of the binast chars are copied versus 95% of the JavaScript source. By compressed output, 63% of the bits in the .binjs.br file are the coded symbols alone (that is, not including the Huffman tables or symbol insertion lengths etc.)

The Huffman tables for literal character insertions indicate they're high entropy; in contrast the JavaScript source tables are around 4-5 bits/char:

Literal code 0: 28678 chars, 203145 bits, 7.0837 bits/char
Literal code 1: 39827 chars, 284134 bits, 7.1342 bits/char
Literal code 2: 20653 chars, 148846 bits, 7.2070 bits/char
Literal code 3: 2962 chars, 17744 bits, 5.9905 bits/char
Literal code 4: 10147 chars, 65507 bits, 6.4558 bits/char
Literal code 5: 34969 chars, 256076 bits, 7.3229 bits/char
Literal code 6: 21594 chars, 152188 bits, 7.0477 bits/char
Literal code 7: 21909 chars, 157201 bits, 7.1752 bits/char
Literal code 8: 11255 chars, 82166 bits, 7.3004 bits/char
Literal code 9: 39297 chars, 287906 bits, 7.3264 bits/char
Literal code 10: 1982 chars, 14024 bits, 7.0757 bits/char
Literal code 11: 1793 chars, 9705 bits, 5.4127 bits/char
Literal code 12: 614 chars, 3425 bits, 5.5782 bits/char
Literal code 13: 290 chars, 314 bits, 1.0828 bits/char
Literal code 14: 389 chars, 1624 bits, 4.1748 bits/char
Literal code 15: 2426 chars, 14868 bits, 6.1286 bits/char
Literal code 16: 801 chars, 5425 bits, 6.7728 bits/char
Literal code 17: 3981 chars, 23340 bits, 5.8628 bits/char
Literal code 18: 653 chars, 3292 bits, 5.0413 bits/char
Literal code 19: 82 chars, 82 bits, 1.0000 bits/char
Literal code 20: 13 chars, 13 bits, 1.0000 bits/char

Lastly there is a preponderance of nulls bloating n-grams up to 4 bytes; for example the null,null,null 3-gram occupies 12K and occurs 15,517 times (although there may be overcounting here because of overlap.) Note how frequently nulls pop up in n-grams by size but not by occurrence:

3-grams                                                                          
Size (bytes)  Count  Top 25 by size:                                             
   11830.36   15517  ␀␀␀                                                         
    1072.23    1593  get                                                         
     998.05    1984  ent                                                         
     971.89    1625  ion                                                         
     774.78    1259  tio                                                         
     758.46    1012  ␀␀g                                                         
     704.80     590  ␋␀␀                                                         
     692.36     570  ␐␀␀                                                         
     691.82     936  ␀ge                                                         
     684.33     589  ␌␀␀                                                         
     680.51    1725  ost                                                         
     659.29     564  ␎␀␀                                                         
     651.73     560  ␍␀␀                                                         
     649.03     589  ␏␀␀                                                         
     630.07    1581  ted                                                         
     624.28     476  ␉␀␀                                                         
     618.66     473  ␊␀␀                                                         
     604.97     534  ␅␀␀                                                         
     590.12     802  ing                                                         
     587.26     916  ␀␀A                                                         
     585.10     562  ␑␀␀                                                         
     570.87     529  ␒␀␀                                                         
     561.03    1218  tor                                                         
     542.20     556  ␀␀s                                                         
     536.05     678  age                                                         
Size (bytes)  Count  Top 25 by count:                                            
   11830.36   15517  ␀␀␀                                                         
     325.09    3905  ␄␄␄                                                         
     998.05    1984  ent                                                         
     680.51    1725  ost                                                         
     971.89    1625  ion                                                         
    1072.23    1593  get                                                         
     630.07    1581  ted                                                         
     454.15    1371  oos                                                         
     449.48    1334  ste                                                         
     774.78    1259  tio                                                         
     561.03    1218  tor                                                         
     388.23    1215  pon                                                         
     383.97    1147  omp                                                         
     404.64    1138  one                                                         
     361.54    1128  Boo                                                         
     425.46    1090  ␀␀$                                                         
     326.01    1089  mpo                                                         
     299.90    1052  nen                                                         
     317.65    1036  Com                                                         
     758.46    1012  ␀␀g                                                         
     284.14     996  edC                                                         
     285.32     994  dCo                                                         
     691.82     936  ␀ge                                                         
     587.26     916  ␀␀A                                                         
     473.40     834  ati

That Brotli compresses binast/trp less effectively is unsurprising; trp is itself a compression format. Designing trp as a complementary compressor based on tree-based patterns to Brotli's sequence-based ones can close the size gap.

I'm still familiarizing myself with the encoded format and TreeRePair, but here are some ideas based on these observations:

Use varint encoding, variable-sized structs, etc. to reduce the prevalence of null bytes.
Reduce the entropy of the Huffman tables. For example:
- If AST nodes are numbered, then make that context dependent with an LRU buffer or something.
- Use offsets from an implied point which depends on context.
- Slightly wacky but consider mapping small value ranges into ASCII with frequencies which match the frequencies of the string table; for example if there's some range of values 0-4, instead of encoding them as bytes 0x0 .. 0x4 map them to 'e', 'o', 't', '_' ..

Yoric commented 6 years ago

Thanks!

Yoric commented 6 years ago

A few quick notes:

to get more details about what the implementation of trp is doing, you may prefix your command-line call with RUST_LOG=repair;
the version you're using has hardcoded a number of strategies, I'll try and make them customizable today or tomorrow.

I can do analyses on parts of the file but in order to do that it would be super useful to have the encoder dump offsets into the uncompressed stream delineating sections.

This version inlines the dictionary, i.e. the first instance of an unknown symbol is immediately followed by its definition, so there are no useful offsets. I'll try and move the dictionary (back) into the header in the next version.

Use varint encoding, variable-sized structs, etc. to reduce the prevalence of null bytes.

As far as I can tell, we're using varint encoding everywhere (except for floats).
I have no clue why we have so many null bytes, I'll investigate. It might be because of the hardcoded rank 2 limit on lists, which forces me to split and re-split lists. Or it might be because of the dumb encoding of user-level numbers as 64-bit floats.
What do you mean by variable-sized structs? Prefixed by their length?

If AST nodes are numbered, then make that context dependent with an LRU buffer or something.

Good idea. AST nodes themselves are not numbered, but AST labels, including digrams (our unit of substitution) are. I'll try and do that in the next version.

Use offsets from an implied point which depends on context.

For the moment, the only offsets are relative (and they're pretty much between 0 and 6 most of the time). I haven't implemented lazy nodes yet, so that might change.

Slightly wacky but consider mapping small value ranges into ASCII with frequencies which match the frequencies of the string table; for example if there's some range of values 0-4, instead of encoding them as bytes 0x0 .. 0x4 map them to 'e', 'o', 't', '_' ..

That is deliciously evil. I'll definitely work on that :)

Yoric commented 6 years ago

As a side-note, I've been testing the official implementation of TreeRePair, by first converting samples to XML. Unfortunately, the official implementation segfaults on me :/

Yoric commented 6 years ago

the version you're using has hardcoded a number of strategies, I'll try and make them customizable today or tomorrow

Ok, I have just updated a version that supports a new flag trp-rank. Having a small-ish trp-rank should increase the initial size of the tree, but also the ability to find similarities. The authors of the paper recommend a rank of 4.

Yoric commented 6 years ago

Switching to LRU numbering scheme

Yoric commented 6 years ago

Added a new flag numbering, which we may set to mru or frequency.

Yoric commented 6 years ago

(I can do analyses on parts of the file but in order to do that it would be super useful to have the encoder dump offsets into the uncompressed stream delineating sections.)

I just added a new logger. Set environment variable RUST_LOG=repair-io to have it dump the location of everything as it writes it.

Yoric commented 6 years ago

Flag numbering may now also be set to parent to use parent-local numbering. This actually increases the size of files quite a lot :/

dominiccooney commented 6 years ago

I'm starting to do some similar analyses on these now. I was imagining something different for "MRU", more like the explicit labeler but reserving 0..n (probably with pretty small n) to mean "x items ago" and just shifting the explicit labels along to start at n+1, ...

dominiccooney commented 6 years ago

I played with trp-rank and there's exponentially diminishing returns and compression time increases. I see the counter productiveness of MRU but I haven't dug into explaining it.

I tried doing repair-io for a detailed breakdown by type but got panics and haven't dug into it.

In the meantime I took a look at multistream compression, admittedly with some local hacks, and the caveat that I'm just looking at that one big Facebook file as an example.

Here's some suggestions:

Most (~60%) of the compressed file is in strings and the references to them, so it's worth focusing some energy there.
Sorting the string table is good for Brotli copies and that is worth more than anything we gain by having frequent string refs be smaller. Multistream with a sorted string table has 83% of the source bytes come from copies and they're "close" (eyeballing it, about 20% of the copies use zero extra distance bits.)
We possibly could get more copies by moving the string table up, closer to the Brotli dictionary, because only an additional 2% source bytes came from dictionary copies. Although I haven't measured if that's because the entries are absent or merely expensive.
We can possibly have our cake and eat it too by putting the top n% (small n) strings first to give them low numbered IDs and sort the rest. (This should be under the control of the encoder and not the format.)
Possibly there's something to do with string lengths, too--maybe they should be out of line so the string copies get closer together, and maybe encoded as deltas. Alternatively we could bucket strings by length and just have count,length,count*lengthstringswithoutseparators,count,length,... etc.
Strings with similar prefixes tend to be mentioned together a bunch, so while MRU doesn't work well (the strings are different) the IDs are close together if the string table is sorted. I want to try an encoding for string references which devotes some bits (2?) to switching between an MRU delta and literals. Combined with the putting frequent strings first (out of order) maybe something like:

Flags	Meaning
00	6-bit two's complement delta of MRU 0
01	6-bit two's complement delta of MRU 1
10	6-bit literal
11	Multi-byte literal, 6 l.o. bits follow then DWARF encoding in the subsequent bits

We'd need to make the varnums signed, ie have the encoder skip all high bits/decoder sign extend the h.o. bit.

Depending on how the frequency idea works out maybe code 10 should be devoted to a delta of MRU2, or a multi-byte delta, I just need to measure it.

I should measure TRP with a sorted string table; it's possible TRP with a sorted string table and some ideas from multistream will get us to parity.

Yoric commented 6 years ago

In the meantime I took a look at multistream compression, admittedly with some local hacks, and the caveat that I'm just looking at that one big Facebook file as an example.

Yeah, I'm currently focusing on multistream. Well, to be fair, I'm preparing a second version of repair that I hope should be able to work nicely with multistream e.g. to diminish the number of string references.

By the way, have you noticed that argument --export-sections exports each stream to a different file?

Most (~60%) of the compressed file is in strings and the references to them, so it's worth focusing some energy there.

Yep. For string themselves, I'll see if a Burrows-Wheeler Transform can improve things. If it does, I'll explore a variant of BWT that works by blocks, so as not to hurt the ability to perform streaming decompression (also, maybe streaming compression, later). Also, I have recently rewritten the management of variable names (which are scoped, making things simpler) from other strings. On my sample, references to variables names are compressed to 38% of their size, and I believe we can do better with a good predictor. I'm thinking that there may be also something to do with property names, which are also fairly predictable.

Sorting the string table is good for Brotli copies and that is worth more than anything we gain by having frequent string refs be smaller. Multistream with a sorted string table has 83% of the source bytes come from copies and they're "close" (eyeballing it, about 20% of the copies use zero extra distance bits.)

Sorting how? By alphanumerical order, by length, something else?

We possibly could get more copies by moving the string table up, closer to the Brotli dictionary, because only an additional 2% source bytes came from dictionary copies. Although I haven't measured if that's because the entries are absent or merely expensive.

I'm not sure I understand this one.

What you're suggesting is about compressing everything with a single brotli dictionary, right?

While I have not documented this yet, my idea with multistream is to eventually make this a streaming decompression format by introducing packets. Each packet would contain a number of bytes from the primary stream (the "tree" stream, compressed with its own dictionary, and flushed), as well as the bytes from the other streams needed to parse that primary stream – each stream being compressed with its own dictionary.

We can possibly have our cake and eat it too by putting the top n% (small n) strings first to give them low numbered IDs and sort the rest. (This should be under the control of the encoder and not the format.)

Agreed.

Possibly there's something to do with string lengths, too--maybe they should be out of line so the string copies get closer together, and maybe encoded as deltas. Alternatively we could bucket strings by length and just have count,length,count*lengthstringswithoutseparators,count,length,... etc.

That's for the start of the string dictionary, right? I haven't checked how much space that takes.

Strings with similar prefixes tend to be mentioned together a bunch, so while MRU doesn't work well (the strings are different) the IDs are close together if the string table is sorted. I want to try an encoding for string references which devotes some bits (2?) to switching between an MRU delta and literals.

Interesting.

I should measure TRP with a sorted string table; it's possible TRP with a sorted string table and some ideas from multistream will get us to parity.

Are you familiar enough with the code to do this quickly? I won't be able to help overmuch, as I'll be on training next week.

dominiccooney commented 6 years ago

OK, I've sketched some ideas in the commits here. I'm new to Rust and have no idea what I'm doing ;)

Using compare_compression -c identity --dictionary header --format multistream on that 1.44MB file, here are the size ratios of BinAST+Brotli : source+Brotli:

What	Size Ratio
Baseline (7923aae588af4408ae46ecadeede888c175a5cd6)	1.357
Sort the string table, implicit labels	1.259
+MRU delta encoding	~1.15
+Frequent strings in slot 0..31, and these skip MRU	1.142

Big caveat with this is I haven't implemented a decoder (is there a working multistream decoder? How do I run it?) and I'm just looking at that one big file and might have overfit it.

I think MRU deltas might be useful for other id streams. Last week I was TreeRePair blow up ratios of 1.23; I think if we should try TreeRePair with an out-of-line string table like this.

Some replies to the earlier thread, sorry for the slow reply:

I'm preparing a second version of repair that I hope should be able to work nicely with multistream e.g. to diminish the number of string references.

Super! Please give me pointers when there's something to try.

By the way, have you noticed that argument --export-sections exports each stream to a different file?

Yup! I'm finding them quite handy.

Sorting the string table is good for Brotli copies ...

Sorting how? By alphanumerical order, by length, something else?

Yes, in alphanumerical order.

What you're suggesting is about compressing everything with a single brotli dictionary, right?

Background: Brotli has a dictionary baked into it. References to that dictionary are actually out-of-bounds references to the decompressed content. (There's a rolling window over the decompressed content, and incidentally everyone with large static resources sets that window to ~1MB in order to get copies.) Distances cost roughly log number of bits.

The idea: If the content comes towards the start of the file, you can quickly go out of bounds and refer to the static dictionary with a short distance.

The Brotli team are working on an update which lets applications use a custom dictionary, however I doubt the way they are planning to do the integration will pan out for static resources. In either case the idea is similar.

dominiccooney commented 6 years ago

Brief update, I've tried implementing Burrows-Wheeler (commit) but it seems counterproductive for Brotli. With the sorted string table + BWT the size ratio regresses from 1.142 → 1.209. Guess there's just not enough entropy to go around ;)

binast / binjs-ref

Investigate TreeRePair tree compression #126