binast / container-format-spec

A repository for working on the specifications of the low-level binary format (very much a Work In Progress)
0 stars 0 forks source link

Binary format for the static prediction tables #7

Open Yoric opened 5 years ago

Yoric commented 5 years ago

Since we want to ship dictionaries, we need to specify their format, too.

I'm pretty sure that we can use efficient entropy coding also for the dictionaries, fwiw.

Yoric commented 5 years ago

I got some data with the latest dictionary builder. We'll need an efficient format to actually ship dictionaries.

Yoric commented 5 years ago

A few things to recall:

In turn, this shipping as diffs suggests that we want to ship number of occurrences (easy to add to without precision loss), rather than actual probabilities (floating points, harder to manipulate).

kannanvijayan-zz commented 5 years ago

Recapping some conversations from IRC here.

Firstly, I feel we should settle on some consistent terminology. "Dictionary" is a very overloaded term and suggests words or character-sequence tables of some sort. I would suggest using "Prediction Tables" to talk about this. They are mappings from various contexts (e.g. Tree-path-suffix, or move-to-front cache of strings) to a table which acts as a predictor that provides a probability distribution for elements showing up at some location in the data stream.

That said, the key thing that's relevant here is that we have decided that prediction tables will not be shipped in-line with the content, and that we will allow compressed content to reference a prediction table (e.g. with a URL + verification hash). This effectively offloads the problem entirely.

Initially, we can produce a "standard" prediction table derived from a "general corpus" of javascript. This is the default table which tools can use. This table can be made available at a standard "well-known" location (or shipped with the browser, as an implementation detail).

In the general case, an individual binast file may reference a custom prediction table. However, we know that we can expect that these tables will change very frequently, and will be effectively be cached locally for every request that is made to content that references it (except for the first).

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:

  1. Receive binast file A.
  2. Header of binast file A identifies probability table T by URL+content-hash
  3. Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
  4. Proceed to decoding binast file A using probability table T.

In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

kannanvijayan-zz commented 5 years ago

Pinging @dominiccooney for discussion alert.

Yoric commented 5 years ago

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:

Receive binast file A.
Header of binast file A identifies probability table T by URL+content-hash
Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
Proceed to decoding binast file A using probability table T.

In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

Note that it's a bit more complicated than that, as:

We should move this conversation in the relevant bug, though: https://github.com/binast/container-format-spec/issues/4 . Consequently, we can already imagine that SpiderMonkey will need to expose an API for Necko to inspect A to determine association with T.