Open Yoric opened 5 years ago
Not working on it at the moment.
I should add that the current work on improving content streams compression using Brotli seems pretty encouraging, so we might not need this work. Time will tell.
So, I have resumed work on this. I currently run two passes, one to collect the probabilities, one to use them.
Early results with a depth of 4 (arbitrary number) for user-extensible values.
Code at https://github.com/Yoric/binjs-ref/tree/entropy-0.4-context.
File | raw | brotli | size vs master |
---|---|---|---|
js | 43134534 | 8016723 | 1 |
binjs | 10124059 | 10045870 | 1.11582393434436 |
floats.content | 336910 | 110130 | 1 |
floats.prelude | 126094 | 72487 | 1 |
identifier_names.content | 512784 | 307798 | 0.308543750244341 |
identifier_names.prelude | 86304 | 52784 | 0.997298165397623 |
identifier_names_len.prelude | 26637 | 16132 | 1.02062507908389 |
list_lengths.content | 1985830 | 550497 | 1 |
list_lengths.prelude | 10284 | 12132 | 1 |
main.entropy | 4092510 | 4084316 | 2.34163295048724 |
probabilities.prelude | 1567235 | 373366 | |
probabilities_len.prelude | 259162 | 112534 | |
property_keys.content | 795048 | 523319 | 0.530639167313084 |
property_keys.prelude | 3150015 | 999774 | 0.989081044824403 |
property_keys_len.prelude | 220936 | 140985 | 0.975397983963028 |
string_literals.content | 666541 | 298437 | 0.340562543221204 |
string_literals.prelude | 6205428 | 1939120 | 0.981317657513498 |
string_literals_len.prelude | 380515 | 241581 | 0.963891138765755 |
unsigned_longs.content | 449125 | 91089 | 1 |
unsigned_longs.prelude | 4489 | 6337 | 1 |
It decreases a lot {identifier_names, property_keys, string_literals}.content
, it adds probabilities*.prelude
which are reasonably small, but it increases a lot main.entropy
.
It's not quite clear to me why this changes the prelude size of anything.
The result is negative so far.
I have tested with larger depths.
Interestingly, the size of the dictionary barely increases, while the size of the main stream decreases.
With a depth of 14
File | raw | brotli | size vs master |
---|---|---|---|
js | 43134534 | 8016723 | 1 |
binjs | 10044953 | 9967490 | 1.10711804028303 |
floats.content | 336910 | 110130 | 1 |
floats.prelude | 126094 | 72487 | 1 |
identifier_names.content | 950947 | 544427 | 0.545746068246953 |
identifier_names.prelude | 86304 | 52598 | 0.993783891019706 |
identifier_names_len.prelude | 26637 | 15466 | 0.978489181323548 |
list_lengths.content | 1985830 | 550497 | 1 |
list_lengths.prelude | 10284 | 12132 | 1 |
main.entropy | 3271490 | 3264202 | 1.8714426014653 |
probabilities.prelude | 2274654 | 389747 | |
probabilities_len.prelude | 1044650 | 228258 | |
property_keys.content | 1072149 | 671845 | 0.681242743648633 |
property_keys.prelude | 3150015 | 1002817 | 0.992091498806404 |
property_keys_len.prelude | 220936 | 140865 | 0.974567769698563 |
string_literals.content | 947770 | 507361 | 0.57897697836144 |
string_literals.prelude | 6205428 | 1950923 | 0.987290723807297 |
string_literals_len.prelude | 380515 | 243366 | 0.97101316277715 |
unsigned_longs.content | 449125 | 91089 | 1 |
unsigned_longs.prelude | 4489 | 6337 | 1 |
It's still a regression, but it is getting better.
We know that we cannot simply use context-based entropy predictions for user-extensible values, as there are always going to be values that do not fit in the dictionary. So, for the moment, we entirely give up on entropy and rather use brotli content streams.
I believe that there is a way to take the best of both worlds.
Decompression
The probability manipulation is O(1) as we're only ever adjusting the last two elements.
Data structures
We replace the brotli-compressed content streams (one index per instance in the file) with:
Intuitively, if the value appears at least twice in the same context, we should decrease the file size. At this stage, it is unclear how this will translate post-Brotli.
Relationship with the dictionaries
We can decide that dictionaries (both in the prelude and built-in) also stores, for each value, a number of instances. This replaces the stream of instance numbers by something less precise but guaranteed to only appear once per value (rather than once per context).