binast / binjs-ref

Reference implementation for the JavaScript Binary AST format
https://binast.github.io/binjs-ref/binjs/index.html
Other
433 stars 38 forks source link

Embedded sublanguages #296

Open Yoric opened 5 years ago

Yoric commented 5 years ago

JavaScript has subsets that we can probably encode more efficiently, either with different dictionaries or different strategies.

For instance, if we know by earlier analysis that an expression is a JSON-style literal declaration, we could shift, for this subtree, to a dictionary that assigns non-0 probabilities only to literals. This would make compression more efficient on that subtree.

Similarly, if we know by earlier analysis that an expression is an arithmetic expression, we could shift to an arithmetic expression dictionary.

Yoric commented 5 years ago

One way to do this would be to:

1/ somehow patch our webidl to expand Expression and add @@LiteralExpression, @@ArithmeticExpression; 2/ have Shift=>ast conversion analyze the values it converts and, if useful, produce instances of @@LiteralExpression or @@ArithmeticExpression instead of the more general variants; 3/ similarly, convert back from @@LiteralExpression or @@ArithmeticExpression into their respective general variants when converting ast=>Shift.

Yoric commented 5 years ago

Perhaps we could somehow inject a node

interface BinASTExpressionWithProbabilityTable {
   expression: Expression;
   table: string;
}

into Expression, and then rely on heuristics to decide when to replace an expression foo with BinASTExpressionWithProbabilityTable { foo, table }.

Yoric commented 5 years ago

For the time being, I'm going to assume that we can switch the probability tables, but not the content dictionaries.

Yoric commented 5 years ago

On the Facebook test set, I see a very, very slight improvement (less than 1/1000).

Yoric commented 5 years ago

Testing with https://github.com/binast/binjs-ref/pull/372

Data

File raw brotli  ratio vs. master
js 43134534 8016723 1
binjs 8930118 8860979 1.00058085996408
       
floats.content 336910 110130 1
floats.prelude 126094 72487 1
identifier_names.content 2524124 997583 1
identifier_names.prelude 86304 52927 1
identifier_names_len.prelude 26637 15806 1
list_lengths.content 1985830 550497 1
list_lengths.prelude 10284 12132 1
main.entropy 1598083 1590968 0.993700399861591
property_keys.content 1630992 986205 1
property_keys.prelude 3150015 1010811 1
property_keys_len.prelude 220936 144541 1
string_literals.content 1813042 889629 1.01520359326537
string_literals.prelude 6207076 1977802 1.00089320189855
string_literals_len.prelude 380927 250932 1.00120096875486
unsigned_longs.content 449125 91089 1
unsigned_longs.prelude 4489 6337 1

Note It's not clear to me why there is any impact on string_literals*.prelude.

Result

~0.7% improvement on the main stream, ~1.5% regression on string_literals_content. Overall, a slight regression. I suspect that we can improve it, though.