Open Yoric opened 6 years ago
According to https://arxiv.org/pdf/1506.04499v1.pdf, TreeRePair seems to have the best tree compression around.
The graphs on p. 10 and 11 seem to indicate that TreeRePair produces smaller output.
You're right, I meant TreeRePair, rather than RePair. Fixing title and comment.
This is being explored in branch treerepair.
I have an experimental version, which gives us the following results:
While the format does reduce the file size before gzip compression, it's still not competitive.
Not sure if this is useful, but I've done some analysis of how effective Brotli is on top of TreeRePair. Some knowledge of Brotli is assumed for brevity but please ask if something is unclear:
Methodology:
cargo run --bin binjs_encode -- -i $THAT --format=trp -o trp-out
Uncompressed binast is smaller (640K versus 1.44M) but Brotli compresses it much less effectively (0.53x versus 0.15x) so the resulting compressed size is larger (339K versus 218K.)
In particular, binast/trp produces lots of literal symbol insertion and few copies. By source 59% of the binast chars are copied versus 95% of the JavaScript source. By compressed output, 63% of the bits in the .binjs.br file are the coded symbols alone (that is, not including the Huffman tables or symbol insertion lengths etc.)
The Huffman tables for literal character insertions indicate they're high entropy; in contrast the JavaScript source tables are around 4-5 bits/char:
Literal code 0: 28678 chars, 203145 bits, 7.0837 bits/char
Literal code 1: 39827 chars, 284134 bits, 7.1342 bits/char
Literal code 2: 20653 chars, 148846 bits, 7.2070 bits/char
Literal code 3: 2962 chars, 17744 bits, 5.9905 bits/char
Literal code 4: 10147 chars, 65507 bits, 6.4558 bits/char
Literal code 5: 34969 chars, 256076 bits, 7.3229 bits/char
Literal code 6: 21594 chars, 152188 bits, 7.0477 bits/char
Literal code 7: 21909 chars, 157201 bits, 7.1752 bits/char
Literal code 8: 11255 chars, 82166 bits, 7.3004 bits/char
Literal code 9: 39297 chars, 287906 bits, 7.3264 bits/char
Literal code 10: 1982 chars, 14024 bits, 7.0757 bits/char
Literal code 11: 1793 chars, 9705 bits, 5.4127 bits/char
Literal code 12: 614 chars, 3425 bits, 5.5782 bits/char
Literal code 13: 290 chars, 314 bits, 1.0828 bits/char
Literal code 14: 389 chars, 1624 bits, 4.1748 bits/char
Literal code 15: 2426 chars, 14868 bits, 6.1286 bits/char
Literal code 16: 801 chars, 5425 bits, 6.7728 bits/char
Literal code 17: 3981 chars, 23340 bits, 5.8628 bits/char
Literal code 18: 653 chars, 3292 bits, 5.0413 bits/char
Literal code 19: 82 chars, 82 bits, 1.0000 bits/char
Literal code 20: 13 chars, 13 bits, 1.0000 bits/char
Lastly there is a preponderance of nulls bloating n-grams up to 4 bytes; for example the null,null,null 3-gram occupies 12K and occurs 15,517 times (although there may be overcounting here because of overlap.) Note how frequently nulls pop up in n-grams by size but not by occurrence:
3-grams
Size (bytes) Count Top 25 by size:
11830.36 15517 ␀␀␀
1072.23 1593 get
998.05 1984 ent
971.89 1625 ion
774.78 1259 tio
758.46 1012 ␀␀g
704.80 590 ␋␀␀
692.36 570 ␐␀␀
691.82 936 ␀ge
684.33 589 ␌␀␀
680.51 1725 ost
659.29 564 ␎␀␀
651.73 560 ␍␀␀
649.03 589 ␏␀␀
630.07 1581 ted
624.28 476 ␉␀␀
618.66 473 ␊␀␀
604.97 534 ␅␀␀
590.12 802 ing
587.26 916 ␀␀A
585.10 562 ␑␀␀
570.87 529 ␒␀␀
561.03 1218 tor
542.20 556 ␀␀s
536.05 678 age
Size (bytes) Count Top 25 by count:
11830.36 15517 ␀␀␀
325.09 3905 ␄␄␄
998.05 1984 ent
680.51 1725 ost
971.89 1625 ion
1072.23 1593 get
630.07 1581 ted
454.15 1371 oos
449.48 1334 ste
774.78 1259 tio
561.03 1218 tor
388.23 1215 pon
383.97 1147 omp
404.64 1138 one
361.54 1128 Boo
425.46 1090 ␀␀$
326.01 1089 mpo
299.90 1052 nen
317.65 1036 Com
758.46 1012 ␀␀g
284.14 996 edC
285.32 994 dCo
691.82 936 ␀ge
587.26 916 ␀␀A
473.40 834 ati
That Brotli compresses binast/trp less effectively is unsurprising; trp is itself a compression format. Designing trp as a complementary compressor based on tree-based patterns to Brotli's sequence-based ones can close the size gap.
I'm still familiarizing myself with the encoded format and TreeRePair, but here are some ideas based on these observations:
Thanks!
A few quick notes:
RUST_LOG=repair
;I can do analyses on parts of the file but in order to do that it would be super useful to have the encoder dump offsets into the uncompressed stream delineating sections.
This version inlines the dictionary, i.e. the first instance of an unknown symbol is immediately followed by its definition, so there are no useful offsets. I'll try and move the dictionary (back) into the header in the next version.
Use varint encoding, variable-sized structs, etc. to reduce the prevalence of null bytes.
If AST nodes are numbered, then make that context dependent with an LRU buffer or something.
Good idea. AST nodes themselves are not numbered, but AST labels, including digrams (our unit of substitution) are. I'll try and do that in the next version.
Use offsets from an implied point which depends on context.
For the moment, the only offsets are relative (and they're pretty much between 0 and 6 most of the time). I haven't implemented lazy nodes yet, so that might change.
Slightly wacky but consider mapping small value ranges into ASCII with frequencies which match the frequencies of the string table; for example if there's some range of values 0-4, instead of encoding them as bytes 0x0 .. 0x4 map them to 'e', 'o', 't', '_' ..
That is deliciously evil. I'll definitely work on that :)
As a side-note, I've been testing the official implementation of TreeRePair, by first converting samples to XML. Unfortunately, the official implementation segfaults on me :/
the version you're using has hardcoded a number of strategies, I'll try and make them customizable today or tomorrow
Ok, I have just updated a version that supports a new flag trp-rank
. Having a small-ish trp-rank
should increase the initial size of the tree, but also the ability to find similarities. The authors of the paper recommend a rank of 4.
Added a new flag numbering
, which we may set to mru
or frequency
.
(I can do analyses on parts of the file but in order to do that it would be super useful to have the encoder dump offsets into the uncompressed stream delineating sections.)
I just added a new logger. Set environment variable RUST_LOG=repair-io
to have it dump the location of everything as it writes it.
Flag numbering
may now also be set to parent
to use parent-local numbering. This actually increases the size of files quite a lot :/
I'm starting to do some similar analyses on these now. I was imagining something different for "MRU", more like the explicit labeler but reserving 0..n (probably with pretty small n) to mean "x items ago" and just shifting the explicit labels along to start at n+1, ...
I played with trp-rank and there's exponentially diminishing returns and compression time increases. I see the counter productiveness of MRU but I haven't dug into explaining it.
I tried doing repair-io for a detailed breakdown by type but got panics and haven't dug into it.
In the meantime I took a look at multistream compression, admittedly with some local hacks, and the caveat that I'm just looking at that one big Facebook file as an example.
Here's some suggestions:
Flags | Meaning |
---|---|
00 | 6-bit two's complement delta of MRU 0 |
01 | 6-bit two's complement delta of MRU 1 |
10 | 6-bit literal |
11 | Multi-byte literal, 6 l.o. bits follow then DWARF encoding in the subsequent bits |
We'd need to make the varnums signed, ie have the encoder skip all high bits/decoder sign extend the h.o. bit.
Depending on how the frequency idea works out maybe code 10 should be devoted to a delta of MRU2, or a multi-byte delta, I just need to measure it.
I should measure TRP with a sorted string table; it's possible TRP with a sorted string table and some ideas from multistream will get us to parity.
In the meantime I took a look at multistream compression, admittedly with some local hacks, and the caveat that I'm just looking at that one big Facebook file as an example.
Yeah, I'm currently focusing on multistream. Well, to be fair, I'm preparing a second version of repair that I hope should be able to work nicely with multistream e.g. to diminish the number of string references.
By the way, have you noticed that argument --export-sections
exports each stream to a different file?
Most (~60%) of the compressed file is in strings and the references to them, so it's worth focusing some energy there.
Yep. For string themselves, I'll see if a Burrows-Wheeler Transform can improve things. If it does, I'll explore a variant of BWT that works by blocks, so as not to hurt the ability to perform streaming decompression (also, maybe streaming compression, later). Also, I have recently rewritten the management of variable names (which are scoped, making things simpler) from other strings. On my sample, references to variables names are compressed to 38% of their size, and I believe we can do better with a good predictor. I'm thinking that there may be also something to do with property names, which are also fairly predictable.
Sorting the string table is good for Brotli copies and that is worth more than anything we gain by having frequent string refs be smaller. Multistream with a sorted string table has 83% of the source bytes come from copies and they're "close" (eyeballing it, about 20% of the copies use zero extra distance bits.)
Sorting how? By alphanumerical order, by length, something else?
We possibly could get more copies by moving the string table up, closer to the Brotli dictionary, because only an additional 2% source bytes came from dictionary copies. Although I haven't measured if that's because the entries are absent or merely expensive.
I'm not sure I understand this one.
What you're suggesting is about compressing everything with a single brotli dictionary, right?
While I have not documented this yet, my idea with multistream is to eventually make this a streaming decompression format by introducing packets. Each packet would contain a number of bytes from the primary stream (the "tree" stream, compressed with its own dictionary, and flushed), as well as the bytes from the other streams needed to parse that primary stream – each stream being compressed with its own dictionary.
We can possibly have our cake and eat it too by putting the top n% (small n) strings first to give them low numbered IDs and sort the rest. (This should be under the control of the encoder and not the format.)
Agreed.
Possibly there's something to do with string lengths, too--maybe they should be out of line so the string copies get closer together, and maybe encoded as deltas. Alternatively we could bucket strings by length and just have count,length,count*lengthstringswithoutseparators,count,length,... etc.
That's for the start of the string dictionary, right? I haven't checked how much space that takes.
Strings with similar prefixes tend to be mentioned together a bunch, so while MRU doesn't work well (the strings are different) the IDs are close together if the string table is sorted. I want to try an encoding for string references which devotes some bits (2?) to switching between an MRU delta and literals.
Interesting.
I should measure TRP with a sorted string table; it's possible TRP with a sorted string table and some ideas from multistream will get us to parity.
Are you familiar enough with the code to do this quickly? I won't be able to help overmuch, as I'll be on training next week.
OK, I've sketched some ideas in the commits here. I'm new to Rust and have no idea what I'm doing ;)
Using compare_compression -c identity --dictionary header --format multistream
on that 1.44MB file, here are the size ratios of BinAST+Brotli : source+Brotli:
What | Size Ratio |
---|---|
Baseline (7923aae588af4408ae46ecadeede888c175a5cd6) | 1.357 |
Sort the string table, implicit labels | 1.259 |
+MRU delta encoding | ~1.15 |
+Frequent strings in slot 0..31, and these skip MRU | 1.142 |
Big caveat with this is I haven't implemented a decoder (is there a working multistream decoder? How do I run it?) and I'm just looking at that one big file and might have overfit it.
I think MRU deltas might be useful for other id streams. Last week I was TreeRePair blow up ratios of 1.23; I think if we should try TreeRePair with an out-of-line string table like this.
Some replies to the earlier thread, sorry for the slow reply:
I'm preparing a second version of repair that I hope should be able to work nicely with multistream e.g. to diminish the number of string references.
Super! Please give me pointers when there's something to try.
By the way, have you noticed that argument --export-sections exports each stream to a different file?
Yup! I'm finding them quite handy.
Sorting the string table is good for Brotli copies ...
Sorting how? By alphanumerical order, by length, something else?
Yes, in alphanumerical order.
What you're suggesting is about compressing everything with a single brotli dictionary, right?
Background: Brotli has a dictionary baked into it. References to that dictionary are actually out-of-bounds references to the decompressed content. (There's a rolling window over the decompressed content, and incidentally everyone with large static resources sets that window to ~1MB in order to get copies.) Distances cost roughly log number of bits.
The idea: If the content comes towards the start of the file, you can quickly go out of bounds and refer to the static dictionary with a short distance.
The Brotli team are working on an update which lets applications use a custom dictionary, however I doubt the way they are planning to do the integration will pan out for static resources. In either case the idea is similar.
Brief update, I've tried implementing Burrows-Wheeler (commit) but it seems counterproductive for Brotli. With the sorted string table + BWT the size ratio regresses from 1.142 → 1.209. Guess there's just not enough entropy to go around ;)
See http://www.eti.uni-siegen.de/ti/veroeffentlichungen/12-repair.pdf