Yoric commented 5 years ago

We know that we cannot simply use context-based entropy predictions for user-extensible values, as there are always going to be values that do not fit in the dictionary. So, for the moment, we entirely give up on entropy and rather use brotli content streams.

I believe that there is a way to take the best of both worlds.

Decompression

For non-user-extensible values, proceed as usual.
Whenever we meet a user-extensible slot in context C
- We have a probability distribution for C, as usual.
- This probability distribution contains
  - the symbols already encountered so far;
  - a unknown symbol (initial probability: 1).
- Read the symbol.
- If the index is one of the symbols encountered so far, we're done.
- If the index is that of the unknown symbol
  - Read from the reference stream a symbol.
  - If this is the first symbol
    - Read from the probability stream a total number of instances of C, which we'll use to initialize the probability of the unknown symbol.
  - Otherwise
    - Remove the unknown symbol from the probability distribution.
  - Read from the probability stream a number of instances. (Note that we can probably bucket these number of instances to simplify compression of the probability stream)
  - Extend the probability distribution by one new known symbol, in last position.
  - Extend the probability distribution by the new unknown symbol, in last position.

The probability manipulation is O(1) as we're only ever adjusting the last two elements.

Data structures

We replace the brotli-compressed content streams (one index per instance in the file) with:

a stream of references (one index per context in the file in which the value appears);
a stream of instance numbers (one length per context in the file in which the value appears).

Intuitively, if the value appears at least twice in the same context, we should decrease the file size. At this stage, it is unclear how this will translate post-Brotli.

Relationship with the dictionaries

We can decide that dictionaries (both in the prelude and built-in) also stores, for each value, a number of instances. This replaces the stream of instance numbers by something less precise but guaranteed to only appear once per value (rather than once per context).

Yoric commented 5 years ago

Not working on it at the moment.

Yoric commented 5 years ago

I should add that the current work on improving content streams compression using Brotli seems pretty encouraging, so we might not need this work. Time will tell.

Yoric commented 5 years ago

So, I have resumed work on this. I currently run two passes, one to collect the probabilities, one to use them.

Early results with a depth of 4 (arbitrary number) for user-extensible values.

Code at https://github.com/Yoric/binjs-ref/tree/entropy-0.4-context.

File	raw	brotli	size vs master
js	43134534	8016723	1
binjs	10124059	10045870	1.11582393434436

floats.content	336910	110130	1
floats.prelude	126094	72487	1
identifier_names.content	512784	307798	0.308543750244341
identifier_names.prelude	86304	52784	0.997298165397623
identifier_names_len.prelude	26637	16132	1.02062507908389
list_lengths.content	1985830	550497	1
list_lengths.prelude	10284	12132	1
main.entropy	4092510	4084316	2.34163295048724
probabilities.prelude	1567235	373366
probabilities_len.prelude	259162	112534
property_keys.content	795048	523319	0.530639167313084
property_keys.prelude	3150015	999774	0.989081044824403
property_keys_len.prelude	220936	140985	0.975397983963028
string_literals.content	666541	298437	0.340562543221204
string_literals.prelude	6205428	1939120	0.981317657513498
string_literals_len.prelude	380515	241581	0.963891138765755
unsigned_longs.content	449125	91089	1
unsigned_longs.prelude	4489	6337	1

Interpretation

It decreases a lot {identifier_names, property_keys, string_literals}.content, it adds probabilities*.prelude which are reasonably small, but it increases a lot main.entropy.

It's not quite clear to me why this changes the prelude size of anything.

The result is negative so far.

Yoric commented 5 years ago

I have tested with larger depths.

Interestingly, the size of the dictionary barely increases, while the size of the main stream decreases.

With a depth of 14

File	raw	brotli	size vs master
js	43134534	8016723	1
binjs	10044953	9967490	1.10711804028303

floats.content	336910	110130	1
floats.prelude	126094	72487	1
identifier_names.content	950947	544427	0.545746068246953
identifier_names.prelude	86304	52598	0.993783891019706
identifier_names_len.prelude	26637	15466	0.978489181323548
list_lengths.content	1985830	550497	1
list_lengths.prelude	10284	12132	1
main.entropy	3271490	3264202	1.8714426014653
probabilities.prelude	2274654	389747
probabilities_len.prelude	1044650	228258
property_keys.content	1072149	671845	0.681242743648633
property_keys.prelude	3150015	1002817	0.992091498806404
property_keys_len.prelude	220936	140865	0.974567769698563
string_literals.content	947770	507361	0.57897697836144
string_literals.prelude	6205428	1950923	0.987290723807297
string_literals_len.prelude	380515	243366	0.97101316277715
unsigned_longs.content	449125	91089	1
unsigned_longs.prelude	4489	6337	1

It's still a regression, but it is getting better.

binast / binjs-ref

Mechanism to take advantage of entropy for user-extensible values #293

Decompression

Data structures

Relationship with the dictionaries

Interpretation