Binary format for the static prediction tables

Yoric commented 5 years ago

Since we want to ship dictionaries, we need to specify their format, too.

I'm pretty sure that we can use efficient entropy coding also for the dictionaries, fwiw.

Yoric commented 5 years ago

I got some data with the latest dictionary builder. We'll need an efficient format to actually ship dictionaries.

Yoric commented 5 years ago

A few things to recall:

whatever dictionary we ship today, we still need to ship in 10 years (the format may change, but not the contents);
we will need to ship new dictionaries whenever the JavaScript AST changes, which suggests we may wish to ship them somehow as diffs;
after a few versions of JavaScript, we'll end up with several dictionaries in memory at the same time;
developers may wish to ship a new dictionary because for some reason their code is not compressed correctly by our entropy coder, which also suggest we may wish to ship dictionaries as diffs.

In turn, this shipping as diffs suggests that we want to ship number of occurrences (easy to add to without precision loss), rather than actual probabilities (floating points, harder to manipulate).

kannanvijayan-zz commented 5 years ago

Recapping some conversations from IRC here.

Firstly, I feel we should settle on some consistent terminology. "Dictionary" is a very overloaded term and suggests words or character-sequence tables of some sort. I would suggest using "Prediction Tables" to talk about this. They are mappings from various contexts (e.g. Tree-path-suffix, or move-to-front cache of strings) to a table which acts as a predictor that provides a probability distribution for elements showing up at some location in the data stream.

That said, the key thing that's relevant here is that we have decided that prediction tables will not be shipped in-line with the content, and that we will allow compressed content to reference a prediction table (e.g. with a URL + verification hash). This effectively offloads the problem entirely.

Initially, we can produce a "standard" prediction table derived from a "general corpus" of javascript. This is the default table which tools can use. This table can be made available at a standard "well-known" location (or shipped with the browser, as an implementation detail).

In the general case, an individual binast file may reference a custom prediction table. However, we know that we can expect that these tables will change very frequently, and will be effectively be cached locally for every request that is made to content that references it (except for the first).

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:

Receive binast file A.
Header of binast file A identifies probability table T by URL+content-hash
Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
Proceed to decoding binast file A using probability table T.

In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

kannanvijayan-zz commented 5 years ago

Pinging @dominiccooney for discussion alert.

Yoric commented 5 years ago

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:
Receive binast file A.
Header of binast file A identifies probability table T by URL+content-hash
Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
Proceed to decoding binast file A using probability table T.
In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

Note that it's a bit more complicated than that, as:

A and T need to be received by the browser (in Firefox, that's by Necko);
A and T need to be decompressed by the JS VM (in Firefox, that's SpiderMonkey).

We should move this conversation in the relevant bug, though: https://github.com/binast/container-format-spec/issues/4 . Consequently, we can already imagine that SpiderMonkey will need to expose an API for Necko to inspect A to determine association with T.

binast / container-format-spec

Binary format for the static prediction tables #7