Use domain names prefix tree (trie) for templates storage

KOLANICH commented 2 years ago

Sometimes we have a list of domains to access through a proxy. So we generate a prefix tree or even better MAFSA (a DAG) from this list of domains to in order to optimize lookup and reduce memory footprint. Domains are splitted by dot, the resulting array is reverted (so root domain goes first) and then the result is injected into a library generating and serializing tries/MAFSAs. Then matching is easy, just do the preprocessing and walk the DAG.

Should be useful when there are a lot of domains to be checked against, not 2 ones.

The DAG can also be easily constructed by hand using YAML sysntax.

If one needs, a trie can be easily transformed into a regular expression doing the same, which can work even faster because regexps can be jitted.

Also the extension should have the feature to retrieve the list by URI and automatically update it.

ericjung commented 2 years ago

Thanks for the suggestion. Is this something you're planning on implementing?

KOLANICH commented 2 years ago

For now I'm not planning to implement it
I can share my implementation of trie lookup I have created for a .pac file, but I guess it may make sense to use.
I guess that uBlock Origin has to do the similar tasks and its matching engine may already contain the needed code
Ideally we need to decouple storage and update of lists into a separate addon, and a matching engine constructing a trie into a yet another addon, in order for the devs of each addon not to have to implement it themselves. Again, uBO is the most likely candidate to rip the relevant parts. See https://github.com/open-source-ideas/ideas/issues/60 for more info. I'd like to add that the subscribtion manager addon should allow 2 models of storing, one in own storage and another one on the side of another addon, but controlling the storage through message passing. It is needed because we can precompile a trie and store it in a serialized form, this way we don't have to regenerate it on every startup.

ericjung commented 2 years ago

I am curious to see your implementation for a PAC file if it handles regular expressions. I don’t understand how a trie is going to work with regular expressions that can match any number of domains and/or subdomains.

KOLANICH commented 2 years ago

It doesn't handle regular expressions, I just a pac file

`my.pac`

```javascript "use strict"; const neededProxy = "HTTP some.proxy.example:3128" function Node(suffixes){ let processedSuffixes = []; for(let s of suffixes){ if(s instanceof Array){ let [prefix, childSuffixes] = s s = [prefix, new Node(childSuffixes)]; } else { s = [s, null]; } processedSuffixes.push(s); } this.suffixes = new Map(processedSuffixes); } Node.prototype.get1Level = function(comp){ if(this.suffixes.has(comp)){ let res = this.suffixes.get(comp); if(res === null){ return true; } return res; } return false; } Node.prototype.testSplitted = function(uriSplitted){ let checker = this; for(let comp of uriSplitted){ checker = checker.get1Level(comp); if(! (checker instanceof Node)){ return checker; } } return false; } Node.prototype.test = function(domain){ return this.testSplitted(domain.split(".").reverse()); } const domainz = new Node([ ["a", [ "b", "c", "d", ["e", [ "f", "g" ]], ["h", ["i"]], ]], [ "j", [ [ "k", ["l"] ], "m", ["n", [ "o" ]], "p", ["q", ["r"]], "s", "t", ] ], ["u", "v", ["w", [ "x" ]] ], ["y", [ "z" ]] ]); function FindProxyForURL(url, host) { if(url.startsWith("https:") && domainz.test(host)){ return neededProxy; } return "DIRECT"; } ```

For regexps there is a tool https://github.com/ermanh/trieregex , but it only generates regex0s and it is in python. If you would like to rewrite it, it may make sense to rewrite into haxe, not in JS directly.

ericjung commented 2 years ago

How do you propose to handle regular expressions in a trie?

KOLANICH commented 2 years ago

For the first time - not to handle them at all except of wildcard ones (and my example handles wildcards through a special node). In future it may be possible to parse regexps and integrate them into the automata, but not now, it is a large work. For now it is desireable to achive just the simplest case. I have a list of domains that must be accessed through a proxy because otherwise there is no access to them. Not regexps, but a pretty long list of domains, line-by-line.

erosman commented 2 years ago

For reference only: tld service for webextensions

ericjung commented 2 years ago

I think you are suggesting two pattern matching paths; one for regular expressions and one for literals with no wildcards/regular expressions. Is that right?

KOLANICH commented 2 years ago

One for regular expressions and another one for flat lists of domains without wildcards in the middle (since there is no backtracking), but wildcards in the end are possible. To be honest it is possible to unify them, but I don't think it is needed. To be honest I don't think that the regexp path is needed at all. Regexps are slow and applying then one by one is inefficient if you have a lot of them. Their only advantage is that look familiar. What really people need is structured matching against a parsed URI, not against an URI as a text string. So we need a restricted DSL that will be transformed into a state machine.

The human-readable template DSL can look like a wildcard pattern for a domain name with paths inverted, github.com will look like com.github without a scheme. It is a sequence of tokens separated by dots. abcd matches string abcd literally /rx/ matches strings matching the regexp $ - means non-wildcard match, allowed only in the end. %<number> - borrows n first tokens from the last full pattern. Allowed only in the beginning. Enables the stuff like

io.github.a
%2.b
%2.c

The DSL looks somehow similar to the tree like in the pac file example and can be easily transformed into the trie and back.

Another ingested format is just flat list of domains. To convert it into the DSL each domain name is parsed and its components order is inverted, then the leading %<digit> are easily inferred.

ericjung commented 2 years ago

The DSL is recommend is not user-friendly. People will stop using FoxyProxy if they have to learn a DSL first before using it. As far as I can see, this entire suggestion is related to performance improvement. But there are no numbers showing performance degradation. I am sure as pattern lists get high, performance is impacted in some way, but it's anyone's guess how much. Your suggestion might be called "premature optimization" -- optimizing without knowing the impact of the current solution.

Of course the impact depends on many things like the complexity of regular expressions or wildcards used. I do agree most people don't need the power of reg exp, but again it's an understood DSL that is ubiquitous.

KOLANICH commented 2 years ago

But there are no numbers showing performance degradation. Your suggestion might be called "premature optimization" -- optimizing without knowing the impact of the current solution.

To measure performance degradation one has to implement the feature first.

People will stop using FoxyProxy if they have to learn a DSL first before using it.

FoxyProxy target audience are power users. Non-power users usually don't use addons at all.

The DSL is recommend is not user-friendly.

Regexps are even less user-friendly. Anyway, it is proposed to keep the old methods for the ones using them.

The DSL is recommend is not user-friendly. but again it's an understood DSL that is ubiquitous.

I guess editing a serialized structure is even less user-friendly. The DSL is a compromise between repeating oneself (which is not user-friendly and also bloat) and editing a JSON- or YAML-serialized structure, which is error-prone.

erosman commented 2 years ago

FoxyProxy is being updated for v8.0 in preparation for manifest v3. The next update will support full & partial URL matches, similar to original Firefox FoxyProxy v3 (and current Chrome FoxyProxy) before host limit came into effect in Firefox FoxyProxy v4.

ericjung commented 2 years ago

We will be keeping wildcard and regular expressions in v8.0 and the future.

foxyproxy / firefox-extension

Use domain names prefix tree (trie) for templates storage #182