Encode string-window misses using string-relationship maps

kannanvijayan-zz commented 5 years ago

This is write-up from a discussion on IRC. The comments in #209 suggest splitting the string-misses into components and encoding those using static predictor tables, instead of the string-table carrying global probability information.

One goal in general is to reduce the number of string misses in the first place, and increase the probability of hits in the string window.

We observe in JS source code some "clustering" of usage of strings - usage of particular strings will tend to "cluster". For example, if you saw foo.x a long time ago (and both foo and x have been evicted from the string window), and you subsequently see foo., we should be able to predict that x is more likely to follow than otherwise - even if it does not currently occur in the string-window for property names.

The set of names that occur will also diverge based on context.

For example, the property-access in foo.bar() - a method call, would access a different namespace typically than a read foo.bar. And some if(foo.X) is likely to see the X show next time we see if(foo...).

We already have a mechanism for "finding" predicted strings - the string window. If we can adjust that window cache with dynamic updates, we should be able to enhance the hit rates of the tables (and the distribution of probabilities as well).

This requires no changes to the move-to-front window cache prediction, we simply adjust how the cache is filled. The main question we want to be able to answer at each point is: "given the preceding string-ref (e.g. the foo in foo.x), what were the strings seen after it?

This should be answerable globally in a simple way, agnostic to the tree itself.

When a given string S shows up, what we want to do is find the identifiers, property names, and string literals that have previously shown up after S, and add them to our string windows. We do this before we try to predict using the string window, and allow the string window index frequencies to pick up the gains.

Let's say we maintain the following global map (during decode/encode, not transmitted, and initialized to empty at start):

StringRelationshipMap: StringRef => { idents: Cache<StringRef>, literals: Cache<StringRef>, props: Cache<StringRef> }

The map is updated as follows AFTER a new string has been encoded:

update_relationship_map(rel_map, string_ref) {
  // Find the prior identifier, literal, and prop-name emitted to this string_ref.
  let prior_ident = prior_identifier(string_ref);
  let prior_lit = prior_literal(string_ref);
  let prior_prop = prior_property_name(string_ref);

  // Add this string to the "follows after" set of each prior string (of each kind).
  if prior_ident {
    add_follower(rel_map[prior_ident], string_ref);
  }
  if prior_lit {
    add_follower(rel_map[prior_lit], string_ref);
  }
  if prior_prop {
    add_follower(rel_map[prior_prop], string_ref);
  }
}

add_follower(rel_entry, string_ref) {
  if is_identifier(string_ref) {
    rel_entry.idents.add(string_ref);

  } else if (is_literal(string_ref)) {
    rel_entry.literals.add(string_ref);

  } else if (is_property_name(string_ref)) {
    rel_entry.props.add(string_ref);
  }
}

The map is used as follows BEFORE encoding a string at a particular location.

We know the type of string we are encoding (ident, literal, or prop).
Find the most recently emitted StringRef, and the kind of stringref it was.
Collect the followers of that string, which are of the same kind (ident/lit/prop) as the one being emitted.
Add that follower set to the window.

This approach should allow the string-window to "pull in" strings that have left the window, based on the fact that they had a prior relationship with recently emitted strings.

Pinging @Yoric for feedback.

Yoric commented 5 years ago

Well, as a first remark, we'd like to handle both

foo.x (identifier name => property key);
foo.prototype.toString (identifier name => property key and property key => property key).

Whatever we do, I'd like to use something a bit generic.

Also, in terms of AST, the former is:

StaticMemberExpression:
  _object:
    IdentifierExpression
      foo (IdentifierName)
  property:
    x (PropertyKey)

The order of strings is [(IdentifierName) foo, (PropertyKey) x].

StaticMemberExpression:
  _object:
    StaticMemberExpression:
      _object:
        IdentifierExpression
          foo (IdentifierName)
      property:
          prototype (PropertyKey)
  property:
    IdentifierExpression
      toString (IdentifierName)

The order of strings is [(IdentifierName) foo, (PropertyKey) prototype, (PropertyKey) toString].

So the order is as we expect. Good. Correlations across tables, not so good.

Yoric commented 5 years ago

Random ideas/questions for the moment:

the intuition, for the moment, is that this applies only to StaticMemberExpression, how do we determine whether it would work to other patterns?
how often does it actually happen? is this often enough that it would actually affect the file size?
do we have any hope that this would beat brotli for compressing our own stringrefs? if it doesn't, we should concentrate on either using brotli or replicating the techniques that make sense in our context.

Yoric commented 5 years ago

Note that if, as I believe, we're only interested in foo.x and foo.x.y, we could also encode this as backreferences to nodes in the AST. Possibly by using a window prediction for StaticMemberExpression, which would default to regular (i.e. path-based) prediction in case of MISS.

Filed as #226.

kannanvijayan-zz commented 5 years ago

Random ideas/questions for the moment:

* the intuition, for the moment, is that this applies only to `StaticMemberExpression`, how do we determine whether it would work to other patterns?

The algorithm as described above does try to model all relationships. To describe this a bit more formally, let's say that every string has a kind (identifier, propertyname, or literal), accessible with k(S) for some string S.

When it's about to encode/decode a string S of kind ks = k(S), it looks up the most recently encoded encoded string P with kind kp = k(P).

We then take the follow set of kind ks of P - i.e. the ks-kind cache of follower strings from P, and add that to the string window.

After encoding/decoding the string-ref to S, then S added to the follow-set of P under kind ks.

For the foo.x case, We will have

S= "x", ks = k(S) = PropName
P="foo", kp = k(P) = Ident

For the foo.prototype.x case we will have:

S="x", ks = k(S) = PropName
P="prototype", kp = k(P) = PropName

In the first case, we'd take the follow-set(Identifier, "foo", ofType=PropName) and add it to the string-cache before encoding/decoding.

In the second case, we'd take the follow-set(Identifier, "prototype", ofType=PropName) and add it to the string cache before encoding/decoding.

* how often does it actually happen? is this often enough that it would actually affect the file size?

To confirm this intuition with measurements, you basically have to do the same thing as implement the approach and see how well it works. "Estimating" the effect on a given file will require actually simulating these relationships, at which point you also have what you need to implement the encode/decode.

* do we have any hope that this would [beat brotli](https://github.com/binast/binjs-ref/issues/208#issuecomment-439881818) for compressing our own stringrefs? if it doesn't, we should concentrate on either using brotli or replicating the techniques that make sense in our context.

This experiment is a bit skewed - we are bringing all of brotli's algorithm to bear (including structural matching, etc.) to a subcomponent of our data arranged in a particular way - while brotli's performance against our subcomponent data is being compared against our much simpler approaches for that particular stream.

The more relevant focus should be the total filesize when compressing all of the relevant payload data. If we can beat brotli there, then the fact that (for now) our string-ref compression is technically worse than brotli's performance on a separate stream of pure string-ref varuints is somewhat irrelevant, IMHO.

Yoric commented 5 years ago

the intuition, for the moment, is that this applies only to StaticMemberExpression, how do we determine whether it would work to other patterns?

The algorithm as described above does try to model all relationships.

I mean that my intuition that this applies only to StaticMemberExpression. Do you have other examples where it would show up?

Yoric commented 5 years ago

The more relevant focus should be the total filesize when compressing all of the relevant payload data.

More precisely, I would say that the more relevant focus should be defined in #227 :)

Yoric commented 5 years ago

Early results on #279 seems to suggest that it's not something we should do. To be continued.

binast / binjs-ref

Encode string-window misses using string-relationship maps #223