Compression criteria - Githubissues

Yoric commented 5 years ago

We need compression criteria to determine what is good enough. This will help us:

prioritize dictionary size vs. compression ratio;
prioritize compression ratio vs. decompression speed;
let us declare success/defeat.

Assigning @dominiccooney, as discussed.

dominiccooney commented 5 years ago

I did some model development, based on these assumptions:

The encoder performance is a linear combination of a "dictionary hit" compression ratio and a "dictionary miss" compression ratio. I've been putting the dictionary hit compression ratio relative to JavaScript+Brotli, which is the thing BinAST would displace for our static resources. I've expressed the "miss" compression ratio as a bloat factor on top of the hit compression ratio.
Static resources churn as an exponential function of time, producing dictionary misses. A coefficient to this factor models how much initial coverage a fresh dictionary gets, and the exponent models the churn rate.

For now I have been estimating parameters this way:

Take the churn rate from examining the count of new strings in a subset of our static resources over time. (I'm unsure whether an exponential is the right model; data are noisy. Generalizing from strings to all kinds of compressed resources is also a big simplification.)
Take the dictionary hit compression ratio from measuring the size of binjs_encode advanced entropy -i X etc. with a dictionary built from exactly X (excluding the dictionary size, naturally) divided by the size of brotli -q 11 -w 20 X.
In accordance with the previous bullet, set the initial dictionary coverage to 1.0 since the dictionary was built with X and used to compress X.
I haven't been able to estimate a dictionary miss bloat factor because syntax dictionary misses produced hard failures in the encoder.

Some obvious ways to elaborate the model:

Add a model of users' cache-miss revisit behavior.
Add a model which relates network bytes to decoder time.
Split those factors by resource, for example strings, syntax, etc.

Here's some takeaways from even this basic model:

Focus on producing better compression than Brotli when hitting a dictionary of any size. What I measured recently is not there yet. That would be necessary (but not sufficient) to move BinAST file size out of the "drawbacks" column versus JavaScript+Brotli.
Start to think about compression ratio instead as dictionary-hit compression ratio and dictionary-miss compression ratio. Then your question of prioritizing size versus compression ratio becomes a set of simpler problems: 1. Making the dictionary hit compression ratio better; 2. Making the dictionary miss compression ratio better; 3. Increasing the coverage of a dictionary of a given size.
"Success/defeat" may be very site specific and depend on site-specific factors like repeat visit rate and code churn.

Yoric commented 5 years ago

Focus on producing better compression than Brotli when hitting a dictionary of any size.

I have difficulties parsing this sentence. Does this mean that we shouldn't care about dictionary size yet, just producing better compression than Brotli?

On the upside, there are a number of levers that we haven't used yet to improve compression.

"Success/defeat" may be very site specific and depend on site-specific factors like repeat visit rate and code churn.

Good point.

dominiccooney commented 5 years ago

Sorry for the slow reply, I missed this earlier.

Focus on producing better compression than Brotli when hitting a dictionary of any size.

I have difficulties parsing this sentence. Does this mean that we shouldn't care about dictionary size yet, just producing better compression than Brotli?

For now, yes.

It would be ideal if dictionaries + data were smaller than Brotli. If that is not possible, data has to be smaller than Brotli; then we have a chance to amortize the extra dictionary cost over time. (That situation is not necessarily feasible depending on how often people repeat visits and how dictionaries age; however the situation where the data, without the dictionary, is bigger than Brotli is definitely infeasible.)

binast / binjs-ref

Compression criteria #227