Investigate subtree-dictionary based compression

kannanvijayan-zz commented 6 years ago

Figured I should record this work while I'm doing it.

A number of the spot size optimization suggestions that have come up amount to tree summarizations: noticing that a particular pattern is common in the tree and coming up with a smaller representation for it in the encoding.

Some of these seem to yield benefits in the fractional percent range - in aggregate they may be worthwhile but individually they are quite small.

It seems to me that there's a generalized method for capturing these summarizations, and the approach analogizes well to dictionary-based sequence compression. When dictionary-compressing a sequential stream, we generate a prefix of the remaining plaintext by identifying a dictionary entry as that prefix.

When generating a tree (an AST in our case) instead of a sequence, we are instead concerned with generating some subtree of the remaining normalized tree to be generated. The analogy from a "back-reference" in sequential decoding, is a "relative back-path" from the node currently being generated. The analogy for the "sequence prefix" in sequential encoding is a "tree prefix" - a tree with a series of cut-points which identify where it can be extended.

So the idea is this: an encoder starts encoding the tree, keeping track of the most recently encountered subtrees at various depths on the main tree, as well as an initial LRU-set of templates (initially empty). When encoding a new node, we search these subtrees for a matching template. If none are found, we encode the node and move on. If a matching template object is found, the encoder emits a reference to the template, followed by encodings for the tree-extensions identified by the template. If a matching subtree is found, then the encoder emits a reference to that subtree, a sequence of cutpoints, and then the encoding for the tree extensions at those cuts. The template object generated by this cut is added to the LRU-set of templates.

There are two reasons I think this approach has value.

The first is domain-specific locality. A tree-based locality model is closer to the domain data structures than a sequence-based model. Consider the general case of method lists:

Foo.prototype = {
  someMethod: function (a, b) {
     /* Lots of code. */
  },
  someOtherMethod: function (x, y) {
    /* Lots of other code. */
  },
  ..
}

In these cases, the AST subtree associated with the object-declaration-entry, the named-identifier, and function-declaration nodes are "close together" in a tree sense (siblings), but far apart sequentially because of the code in their bodies.

A tree-based locality model would allow an encoder to succintly identify highly similar sibling nodes that may be far apart in the text (and thus more difficult for a sequential encoder to correlate).

The second reason I think this is valuable is the data-transform done by the encoder. The encoder turns subtree-similarity into prefix-similarity. Consider two subtrees:

(x << 5) + y => Add(Shift(Id(x), 5), Id(y))
(a << 5) + b => Add(Shift(Id(a), 5), Id(b))

When encoded, assuming that both of these subtrees reference the same template of the form Add(Shift(Id(V), 5), Id(W)):

t(X) a b
t(X) x y

This takes text-level differences that were embedded within a contextual template:

(?1 << 5) + ?2

And extrudes the common portions to the beginning:

t(X) ?1 ?2

Intuitively, this should help a sequential encoder pick up those tree commonalities by moving all of the differences between subtrees towards the end (using tree cutpoints to identify which subtrees are being relocated), and the similarities towards the beginning - allowing a sequence-prefix based compressor to hopefully work better with it.

I have a good chunk of a prototype implementation of this - just enough of an implementation to identify how often we're seeing tree-based redundancy being captured by the approach. If that looks promising, then we can write an encoder for it and see how it does on real numbers.

I don't actually anticipate this work to be very heavy, and I hope to have at least some analyses outputs ready by the end of this week.

Yoric commented 6 years ago

For an actual encoder, you should probably fork crates/binjs_io/src/simple.rs.

Yoric commented 6 years ago

Adding a little bibliography:

(one of the papers claims that they managed to improve on a 10x compression, but that was not a binary AST)

Yoric commented 6 years ago

On my side, investigating #126.

Yoric commented 6 years ago

See also http://www.iaeng.org/publication/IMECS2009/IMECS2009_pp580-585.pdf (aka Lempel Ziv for trees).

kannanvijayan-zz commented 6 years ago

I have my (extremely hacky) js code here: https://github.com/kannanvijayan/ast_compress

Overall I just wanted to work from scratch to get an idea of the problem domain and a feel for the types of trees we see and what encoding "feels" like. I'm using an implicit grammar table, but with untyped fields (so all field values have to self-document their types or optionality status via null values, etc.)

My dictionary model is very simple: instead of a single dictionary I split it up across all possible depths. The dictionary uses an move-to-front LRU. Subtrees are added to the cache at the associated depth, on a post-order basis (i.e. we add a subtree after we have generated it). Backreferences are produced by searching through the dictionary at the "current" encode/decode depth, and a small window of relative depths around that depth.

Also, one crude aspect of the impementation is that I treat any differences in field values as causing a cut on the whole subtree (up to the type of the root node of the tree, which can still be referenced).

My results show compression on the raw minjs source, but the post-gzip story is still about 20-30% worse on average.

I also think there are bugs in my cost model because when I expand the sizes of the dictionaries, some files actually generate larger pre-gzip compressed output. This should definitely not happen and indicates that my cost model is making me pick templates that take up more space to encode than the optimal (and enlarging the caches is causing more opportunities for that to happen).

One key result I noticed is that on files with large object literals, my approach improves SIGNIFICANTLY over the reference encoder that Yoric has put up. I still haven't written up the run-length template encoding, and I expect this to give even better results on those structures.

Given that I'm somewhat familiar with the domain now, I'm switching to working directly on @Yoric's rust codebase. I'll implement the run-length template compression first on that and I hope that we can pick up a significant win that way.

binast / binjs-ref

Investigate subtree-dictionary based compression #123