binast / binjs-ref

Reference implementation for the JavaScript Binary AST format
https://binast.github.io/binjs-ref/binjs/index.html
Other
432 stars 38 forks source link

Mechanism to take advantage of entropy for user-extensible values #293

Open Yoric opened 5 years ago

Yoric commented 5 years ago

We know that we cannot simply use context-based entropy predictions for user-extensible values, as there are always going to be values that do not fit in the dictionary. So, for the moment, we entirely give up on entropy and rather use brotli content streams.

I believe that there is a way to take the best of both worlds.

Decompression

  1. For non-user-extensible values, proceed as usual.
  2. Whenever we meet a user-extensible slot in context C
    • We have a probability distribution for C, as usual.
    • This probability distribution contains
      • the symbols already encountered so far;
      • a unknown symbol (initial probability: 1).
    • Read the symbol.
    • If the index is one of the symbols encountered so far, we're done.
    • If the index is that of the unknown symbol
      • Read from the reference stream a symbol.
      • If this is the first symbol
        • Read from the probability stream a total number of instances of C, which we'll use to initialize the probability of the unknown symbol.
      • Otherwise
        • Remove the unknown symbol from the probability distribution.
      • Read from the probability stream a number of instances. (Note that we can probably bucket these number of instances to simplify compression of the probability stream)
      • Extend the probability distribution by one new known symbol, in last position.
      • Extend the probability distribution by the new unknown symbol, in last position.

The probability manipulation is O(1) as we're only ever adjusting the last two elements.

Data structures

We replace the brotli-compressed content streams (one index per instance in the file) with:

Intuitively, if the value appears at least twice in the same context, we should decrease the file size. At this stage, it is unclear how this will translate post-Brotli.

Relationship with the dictionaries

We can decide that dictionaries (both in the prelude and built-in) also stores, for each value, a number of instances. This replaces the stream of instance numbers by something less precise but guaranteed to only appear once per value (rather than once per context).

Yoric commented 5 years ago

Not working on it at the moment.

Yoric commented 5 years ago

I should add that the current work on improving content streams compression using Brotli seems pretty encouraging, so we might not need this work. Time will tell.

Yoric commented 5 years ago

So, I have resumed work on this. I currently run two passes, one to collect the probabilities, one to use them.

Early results with a depth of 4 (arbitrary number) for user-extensible values.

Code at https://github.com/Yoric/binjs-ref/tree/entropy-0.4-context.

File raw brotli size vs master
js 43134534 8016723 1
binjs 10124059 10045870 1.11582393434436
       
floats.content 336910 110130 1
floats.prelude 126094 72487 1
identifier_names.content 512784 307798 0.308543750244341
identifier_names.prelude 86304 52784 0.997298165397623
identifier_names_len.prelude 26637 16132 1.02062507908389
list_lengths.content 1985830 550497 1
list_lengths.prelude 10284 12132 1
main.entropy 4092510 4084316 2.34163295048724
probabilities.prelude 1567235 373366  
probabilities_len.prelude 259162 112534  
property_keys.content 795048 523319 0.530639167313084
property_keys.prelude 3150015 999774 0.989081044824403
property_keys_len.prelude 220936 140985 0.975397983963028
string_literals.content 666541 298437 0.340562543221204
string_literals.prelude 6205428 1939120 0.981317657513498
string_literals_len.prelude 380515 241581 0.963891138765755
unsigned_longs.content 449125 91089 1
unsigned_longs.prelude 4489 6337 1

Interpretation

It decreases a lot {identifier_names, property_keys, string_literals}.content, it adds probabilities*.prelude which are reasonably small, but it increases a lot main.entropy.

It's not quite clear to me why this changes the prelude size of anything.

The result is negative so far.

Yoric commented 5 years ago

I have tested with larger depths.

Interestingly, the size of the dictionary barely increases, while the size of the main stream decreases.

With a depth of 14

File raw brotli size vs master
js 43134534 8016723 1
binjs 10044953 9967490 1.10711804028303
       
floats.content 336910 110130 1
floats.prelude 126094 72487 1
identifier_names.content 950947 544427 0.545746068246953
identifier_names.prelude 86304 52598 0.993783891019706
identifier_names_len.prelude 26637 15466 0.978489181323548
list_lengths.content 1985830 550497 1
list_lengths.prelude 10284 12132 1
main.entropy 3271490 3264202 1.8714426014653
probabilities.prelude 2274654 389747  
probabilities_len.prelude 1044650 228258  
property_keys.content 1072149 671845 0.681242743648633
property_keys.prelude 3150015 1002817 0.992091498806404
property_keys_len.prelude 220936 140865 0.974567769698563
string_literals.content 947770 507361 0.57897697836144
string_literals.prelude 6205428 1950923 0.987290723807297
string_literals_len.prelude 380515 243366 0.97101316277715
unsigned_longs.content 449125 91089 1
unsigned_longs.prelude 4489 6337 1

It's still a regression, but it is getting better.