Open kannanvijayan-zz opened 5 years ago
Well, as a first remark, we'd like to handle both
foo.x
(identifier name => property key);foo.prototype.toString
(identifier name => property key and property key => property key).Whatever we do, I'd like to use something a bit generic.
Also, in terms of AST, the former is:
StaticMemberExpression:
_object:
IdentifierExpression
foo (IdentifierName)
property:
x (PropertyKey)
The order of strings is [(IdentifierName) foo, (PropertyKey) x]
.
StaticMemberExpression:
_object:
StaticMemberExpression:
_object:
IdentifierExpression
foo (IdentifierName)
property:
prototype (PropertyKey)
property:
IdentifierExpression
toString (IdentifierName)
The order of strings is [(IdentifierName) foo, (PropertyKey) prototype, (PropertyKey) toString]
.
So the order is as we expect. Good. Correlations across tables, not so good.
Random ideas/questions for the moment:
StaticMemberExpression
, how do we determine whether it would work to other patterns?Note that if, as I believe, we're only interested in foo.x
and foo.x.y
, we could also encode this as backreferences to nodes in the AST. Possibly by using a window prediction for StaticMemberExpression
, which would default to regular (i.e. path-based) prediction in case of MISS.
Filed as #226.
Random ideas/questions for the moment:
* the intuition, for the moment, is that this applies only to `StaticMemberExpression`, how do we determine whether it would work to other patterns?
The algorithm as described above does try to model all relationships. To describe this a bit more formally, let's say that every string has a kind
(identifier, propertyname, or literal), accessible with k(S) for some string S
.
When it's about to encode/decode a string S
of kind ks = k(S)
, it looks up the most recently encoded encoded string P
with kind kp = k(P)
.
We then take the follow set of kind ks
of P - i.e. the ks-kind
cache of follower strings from P
, and add that to the string window.
After encoding/decoding the string-ref to S
, then S
added to the follow-set of P
under kind ks
.
For the foo.x
case, We will have
S= "x", ks = k(S) = PropName
P="foo", kp = k(P) = Ident
For the foo.prototype.x
case we will have:
S="x", ks = k(S) = PropName
P="prototype", kp = k(P) = PropName
In the first case, we'd take the follow-set(Identifier, "foo", ofType=PropName)
and add it to the string-cache before encoding/decoding.
In the second case, we'd take the follow-set(Identifier, "prototype", ofType=PropName)
and add it to the string cache before encoding/decoding.
* how often does it actually happen? is this often enough that it would actually affect the file size?
To confirm this intuition with measurements, you basically have to do the same thing as implement the approach and see how well it works. "Estimating" the effect on a given file will require actually simulating these relationships, at which point you also have what you need to implement the encode/decode.
* do we have any hope that this would [beat brotli](https://github.com/binast/binjs-ref/issues/208#issuecomment-439881818) for compressing our own stringrefs? if it doesn't, we should concentrate on either using brotli or replicating the techniques that make sense in our context.
This experiment is a bit skewed - we are bringing all of brotli's algorithm to bear (including structural matching, etc.) to a subcomponent of our data arranged in a particular way - while brotli's performance against our subcomponent data is being compared against our much simpler approaches for that particular stream.
The more relevant focus should be the total filesize when compressing all of the relevant payload data. If we can beat brotli there, then the fact that (for now) our string-ref compression is technically worse than brotli's performance on a separate stream of pure string-ref varuints is somewhat irrelevant, IMHO.
- the intuition, for the moment, is that this applies only to
StaticMemberExpression
, how do we determine whether it would work to other patterns?The algorithm as described above does try to model all relationships.
I mean that my intuition that this applies only to StaticMemberExpression
. Do you have other examples where it would show up?
The more relevant focus should be the total filesize when compressing all of the relevant payload data.
More precisely, I would say that the more relevant focus should be defined in #227 :)
Early results on #279 seems to suggest that it's not something we should do. To be continued.
This is write-up from a discussion on IRC. The comments in #209 suggest splitting the string-misses into components and encoding those using static predictor tables, instead of the string-table carrying global probability information.
One goal in general is to reduce the number of string misses in the first place, and increase the probability of hits in the string window.
We observe in JS source code some "clustering" of usage of strings - usage of particular strings will tend to "cluster". For example, if you saw
foo.x
a long time ago (and bothfoo
andx
have been evicted from the string window), and you subsequently seefoo.
, we should be able to predict thatx
is more likely to follow than otherwise - even if it does not currently occur in the string-window for property names.The set of names that occur will also diverge based on context.
For example, the property-access in
foo.bar()
- a method call, would access a different namespace typically than a readfoo.bar
. And someif(foo.X)
is likely to see theX
show next time we seeif(foo...)
.We already have a mechanism for "finding" predicted strings - the string window. If we can adjust that window cache with dynamic updates, we should be able to enhance the hit rates of the tables (and the distribution of probabilities as well).
This requires no changes to the move-to-front window cache prediction, we simply adjust how the cache is filled. The main question we want to be able to answer at each point is: "given the preceding string-ref (e.g. the
foo
infoo.x
), what were the strings seen after it?This should be answerable globally in a simple way, agnostic to the tree itself.
When a given string
S
shows up, what we want to do is find the identifiers, property names, and string literals that have previously shown up afterS
, and add them to our string windows. We do this before we try to predict using the string window, and allow the string window index frequencies to pick up the gains.Let's say we maintain the following global map (during decode/encode, not transmitted, and initialized to empty at start):
StringRelationshipMap: StringRef => { idents: Cache<StringRef>, literals: Cache<StringRef>, props: Cache<StringRef> }
The map is updated as follows AFTER a new string has been encoded:
The map is used as follows BEFORE encoding a string at a particular location.
We know the type of string we are encoding (ident, literal, or prop).
Find the most recently emitted StringRef, and the kind of stringref it was.
Collect the followers of that string, which are of the same kind (ident/lit/prop) as the one being emitted.
Add that follower set to the window.
This approach should allow the string-window to "pull in" strings that have left the window, based on the fact that they had a prior relationship with recently emitted strings.
Pinging @Yoric for feedback.