binast / binjs-ref

Reference implementation for the JavaScript Binary AST format
https://binast.github.io/binjs-ref/binjs/index.html
Other
433 stars 38 forks source link

What happens if we embed the probability tables? #399

Open Yoric opened 5 years ago

Yoric commented 5 years ago

In #293, we have a mechanism to embed probability tables in the prelude. It actually seems to indicate that probability tables don't take that much space.

Could we possibly improve our compression results by giving up on the idea of shared probability tables and rather embedding the probability tables in the file?

Yoric commented 5 years ago

Quick test with https://github.com/Yoric/binjs-ref/tree/entropy-0.4-embed and dictionary depth = 4

This is untested code.

File raw brotli size vs master
js 43134534 8016723 1
binjs 10492698 10390535 1.1541068761235
       
floats.content 336910 110130 1
floats.prelude 126094 72487 1
identifier_names.content 1136755 109247 0.109511689754136
identifier_names.prelude 82185 51915 0.98087932435241
identifier_names_len.prelude 25953 15604 0.987220043021637
interface_names.content 583439 187726  
interface_names.prelude 382683 125190  
interface_names_len.prelude 20871 22379  
list_lengths.content 1985830 550497 1
list_lengths.prelude 10284 12132 1
main.entropy 4638825 4641116 2.66085928528388
probabilities.prelude 9157740 627889  
probabilities_len.prelude 208540 122010  
property_keys.content 2756485 230093 0.233311532592108
property_keys.prelude 2899115 894790 0.885219887793069
property_keys_len.prelude 201946 128929 0.891989124193136
string_enums.content 28186 15093  
string_enums.prelude 11260 12777  
string_enums_len.prelude 5448 6273  
string_literals.content 5468049 153240 0.174870421975885
string_literals.prelude 5978408 1834340 0.928292334607095
string_literals_len.prelude 364302 233533 0.931780186808495
unsigned_longs.content 449125 91089 1
unsigned_longs.prelude 4489 6337 1
Yoric commented 5 years ago

I'm tracking a bug that increase a lot the amount of data we write to *.prelude.

Yoric commented 5 years ago

Latest version

File raw brotli size vs master
js 43134534 8016723 1
binjs 8073786 8026568 0.891534201123702
       
floats.content 363023 154625 1.40402251884137
floats.prelude 126094 72487 1
identifier_names.content 2524124 997583 1
identifier_names.prelude 86304 52927 1
identifier_names_len.prelude 26637 15806 1
interface_names.content 770395 254665  
interface_names.prelude 388429 126610  
interface_names_len.prelude 21193 22707  
list_lengths.content 1986159 549604 0.998377829488626
list_lengths.prelude 10284 12132 1
main.entropy 1678455 1680352 0.963384716465898
probabilities.prelude 946975 292452  
probabilities_len.prelude 175284 76101  
property_keys.content 1630992 986205 1
property_keys.prelude 3150015 1010811 1
property_keys_len.prelude 220936 144541 1
string_enums.content 35874 26162  
string_enums.prelude 11977 13416  
string_enums_len.prelude 5691 6459  
string_literals.content 2461767 1580965 1.80412435838623
string_literals.prelude 6205428 1976037 1
string_literals_len.prelude 380515 250631 1
unsigned_longs.content 449125 96130 1.05534147921264
unsigned_longs.prelude 4489 6337 1

We still embed much data that I'm pretty sure we don't need, but we're now within 1% of brotli. Pending roundtrip.

Yoric commented 5 years ago

Latest version, depth 1, trying to use as much as possible the same protocol as https://github.com/binast/binjs-fbssdc/issues/2.

Facebook sample set

$ cargo run --release --example sample_directory -- --in tests/data/facebook/single/ --sampling 0.2 --depth 1 --follow-symlinks false --min-size 0 --dictionary-threshold 0

binjs/brotli: 1.05

Real js samples

$ cargo run --release --example sample_directory -- --in ~/Downloads/scrap/ --sampling 0.2 --depth 1 --follow-symlinks false --min-size 0 --dictionary-threshold 0

binjs/brotli: 1.03