Open Yoric opened 5 years ago
Quick test with https://github.com/Yoric/binjs-ref/tree/entropy-0.4-embed and dictionary depth = 4
This is untested code.
File | raw | brotli | size vs master |
---|---|---|---|
js | 43134534 | 8016723 | 1 |
binjs | 10492698 | 10390535 | 1.1541068761235 |
floats.content | 336910 | 110130 | 1 |
floats.prelude | 126094 | 72487 | 1 |
identifier_names.content | 1136755 | 109247 | 0.109511689754136 |
identifier_names.prelude | 82185 | 51915 | 0.98087932435241 |
identifier_names_len.prelude | 25953 | 15604 | 0.987220043021637 |
interface_names.content | 583439 | 187726 | |
interface_names.prelude | 382683 | 125190 | |
interface_names_len.prelude | 20871 | 22379 | |
list_lengths.content | 1985830 | 550497 | 1 |
list_lengths.prelude | 10284 | 12132 | 1 |
main.entropy | 4638825 | 4641116 | 2.66085928528388 |
probabilities.prelude | 9157740 | 627889 | |
probabilities_len.prelude | 208540 | 122010 | |
property_keys.content | 2756485 | 230093 | 0.233311532592108 |
property_keys.prelude | 2899115 | 894790 | 0.885219887793069 |
property_keys_len.prelude | 201946 | 128929 | 0.891989124193136 |
string_enums.content | 28186 | 15093 | |
string_enums.prelude | 11260 | 12777 | |
string_enums_len.prelude | 5448 | 6273 | |
string_literals.content | 5468049 | 153240 | 0.174870421975885 |
string_literals.prelude | 5978408 | 1834340 | 0.928292334607095 |
string_literals_len.prelude | 364302 | 233533 | 0.931780186808495 |
unsigned_longs.content | 449125 | 91089 | 1 |
unsigned_longs.prelude | 4489 | 6337 | 1 |
I'm tracking a bug that increase a lot the amount of data we write to *.prelude.
Latest version
File | raw | brotli | size vs master |
---|---|---|---|
js | 43134534 | 8016723 | 1 |
binjs | 8073786 | 8026568 | 0.891534201123702 |
floats.content | 363023 | 154625 | 1.40402251884137 |
floats.prelude | 126094 | 72487 | 1 |
identifier_names.content | 2524124 | 997583 | 1 |
identifier_names.prelude | 86304 | 52927 | 1 |
identifier_names_len.prelude | 26637 | 15806 | 1 |
interface_names.content | 770395 | 254665 | |
interface_names.prelude | 388429 | 126610 | |
interface_names_len.prelude | 21193 | 22707 | |
list_lengths.content | 1986159 | 549604 | 0.998377829488626 |
list_lengths.prelude | 10284 | 12132 | 1 |
main.entropy | 1678455 | 1680352 | 0.963384716465898 |
probabilities.prelude | 946975 | 292452 | |
probabilities_len.prelude | 175284 | 76101 | |
property_keys.content | 1630992 | 986205 | 1 |
property_keys.prelude | 3150015 | 1010811 | 1 |
property_keys_len.prelude | 220936 | 144541 | 1 |
string_enums.content | 35874 | 26162 | |
string_enums.prelude | 11977 | 13416 | |
string_enums_len.prelude | 5691 | 6459 | |
string_literals.content | 2461767 | 1580965 | 1.80412435838623 |
string_literals.prelude | 6205428 | 1976037 | 1 |
string_literals_len.prelude | 380515 | 250631 | 1 |
unsigned_longs.content | 449125 | 96130 | 1.05534147921264 |
unsigned_longs.prelude | 4489 | 6337 | 1 |
We still embed much data that I'm pretty sure we don't need, but we're now within 1% of brotli. Pending roundtrip.
Latest version, depth 1, trying to use as much as possible the same protocol as https://github.com/binast/binjs-fbssdc/issues/2.
$ cargo run --release --example sample_directory -- --in tests/data/facebook/single/ --sampling 0.2 --depth 1 --follow-symlinks false --min-size 0 --dictionary-threshold 0
binjs/brotli: 1.05
$ cargo run --release --example sample_directory -- --in ~/Downloads/scrap/ --sampling 0.2 --depth 1 --follow-symlinks false --min-size 0 --dictionary-threshold 0
binjs/brotli: 1.03
In #293, we have a mechanism to embed probability tables in the prelude. It actually seems to indicate that probability tables don't take that much space.
Could we possibly improve our compression results by giving up on the idea of shared probability tables and rather embedding the probability tables in the file?