Yoric commented 5 years ago

JavaScript has subsets that we can probably encode more efficiently, either with different dictionaries or different strategies.

For instance, if we know by earlier analysis that an expression is a JSON-style literal declaration, we could shift, for this subtree, to a dictionary that assigns non-0 probabilities only to literals. This would make compression more efficient on that subtree.

Similarly, if we know by earlier analysis that an expression is an arithmetic expression, we could shift to an arithmetic expression dictionary.

Yoric commented 5 years ago

One way to do this would be to:

1/ somehow patch our webidl to expand Expression and add @@LiteralExpression, @@ArithmeticExpression; 2/ have Shift=>ast conversion analyze the values it converts and, if useful, produce instances of @@LiteralExpression or @@ArithmeticExpression instead of the more general variants; 3/ similarly, convert back from @@LiteralExpression or @@ArithmeticExpression into their respective general variants when converting ast=>Shift.

Yoric commented 5 years ago

Perhaps we could somehow inject a node

interface BinASTExpressionWithProbabilityTable {
   expression: Expression;
   table: string;
}

into Expression, and then rely on heuristics to decide when to replace an expression foo with BinASTExpressionWithProbabilityTable { foo, table }.

Yoric commented 5 years ago

For the time being, I'm going to assume that we can switch the probability tables, but not the content dictionaries.

Yoric commented 5 years ago

On the Facebook test set, I see a very, very slight improvement (less than 1/1000).

Yoric commented 5 years ago

Testing with https://github.com/binast/binjs-ref/pull/372

Data

File	raw	brotli	ratio vs. master
js	43134534	8016723	1
binjs	8930118	8860979	1.00058085996408

floats.content	336910	110130	1
floats.prelude	126094	72487	1
identifier_names.content	2524124	997583	1
identifier_names.prelude	86304	52927	1
identifier_names_len.prelude	26637	15806	1
list_lengths.content	1985830	550497	1
list_lengths.prelude	10284	12132	1
main.entropy	1598083	1590968	0.993700399861591
property_keys.content	1630992	986205	1
property_keys.prelude	3150015	1010811	1
property_keys_len.prelude	220936	144541	1
string_literals.content	1813042	889629	1.01520359326537
string_literals.prelude	6207076	1977802	1.00089320189855
string_literals_len.prelude	380927	250932	1.00120096875486
unsigned_longs.content	449125	91089	1
unsigned_longs.prelude	4489	6337	1

Note It's not clear to me why there is any impact on string_literals*.prelude.

Result

~0.7% improvement on the main stream, ~1.5% regression on string_literals_content. Overall, a slight regression. I suspect that we can improve it, though.

binast / binjs-ref

Embedded sublanguages #296

Data

Result