Closed KOLANICH closed 2 years ago
Thanks for the suggestion. Is this something you're planning on implementing?
.pac
file, but I guess it may make sense to use.I am curious to see your implementation for a PAC file if it handles regular expressions. I don’t understand how a trie is going to work with regular expressions that can match any number of domains and/or subdomains.
It doesn't handle regular expressions, I just a pac file
For regexps there is a tool https://github.com/ermanh/trieregex , but it only generates regex0s and it is in python. If you would like to rewrite it, it may make sense to rewrite into haxe, not in JS directly.
How do you propose to handle regular expressions in a trie?
For the first time - not to handle them at all except of wildcard ones (and my example handles wildcards through a special node). In future it may be possible to parse regexps and integrate them into the automata, but not now, it is a large work. For now it is desireable to achive just the simplest case. I have a list of domains that must be accessed through a proxy because otherwise there is no access to them. Not regexps, but a pretty long list of domains, line-by-line.
For reference only: tld service for webextensions
I think you are suggesting two pattern matching paths; one for regular expressions and one for literals with no wildcards/regular expressions. Is that right?
One for regular expressions and another one for flat lists of domains without wildcards in the middle (since there is no backtracking), but wildcards in the end are possible. To be honest it is possible to unify them, but I don't think it is needed. To be honest I don't think that the regexp path is needed at all. Regexps are slow and applying then one by one is inefficient if you have a lot of them. Their only advantage is that look familiar. What really people need is structured matching against a parsed URI, not against an URI as a text string. So we need a restricted DSL that will be transformed into a state machine.
The human-readable template DSL can look like a wildcard pattern for a domain name with paths inverted, github.com
will look like com.github
without a scheme. It is a sequence of tokens separated by dots.
abcd
matches string abcd
literally
/rx/
matches strings matching the regexp
$
- means non-wildcard match, allowed only in the end.
%<number>
- borrows n first tokens from the last full pattern. Allowed only in the beginning. Enables the stuff like
io.github.a
%2.b
%2.c
The DSL looks somehow similar to the tree like in the pac file example and can be easily transformed into the trie and back.
Another ingested format is just flat list of domains. To convert it into the DSL each domain name is parsed and its components order is inverted, then the leading %<digit>
are easily inferred.
The DSL is recommend is not user-friendly. People will stop using FoxyProxy if they have to learn a DSL first before using it. As far as I can see, this entire suggestion is related to performance improvement. But there are no numbers showing performance degradation. I am sure as pattern lists get high, performance is impacted in some way, but it's anyone's guess how much. Your suggestion might be called "premature optimization" -- optimizing without knowing the impact of the current solution.
Of course the impact depends on many things like the complexity of regular expressions or wildcards used. I do agree most people don't need the power of reg exp, but again it's an understood DSL that is ubiquitous.
But there are no numbers showing performance degradation. Your suggestion might be called "premature optimization" -- optimizing without knowing the impact of the current solution.
To measure performance degradation one has to implement the feature first.
People will stop using FoxyProxy if they have to learn a DSL first before using it.
FoxyProxy target audience are power users. Non-power users usually don't use addons at all.
The DSL is recommend is not user-friendly.
Regexps are even less user-friendly. Anyway, it is proposed to keep the old methods for the ones using them.
The DSL is recommend is not user-friendly. but again it's an understood DSL that is ubiquitous.
I guess editing a serialized structure is even less user-friendly. The DSL is a compromise between repeating oneself (which is not user-friendly and also bloat) and editing a JSON- or YAML-serialized structure, which is error-prone.
FoxyProxy is being updated for v8.0 in preparation for manifest v3. The next update will support full & partial URL matches, similar to original Firefox FoxyProxy v3 (and current Chrome FoxyProxy) before host limit came into effect in Firefox FoxyProxy v4.
We will be keeping wildcard and regular expressions in v8.0 and the future.
Sometimes we have a list of domains to access through a proxy. So we generate a prefix tree or even better MAFSA (a DAG) from this list of domains to in order to optimize lookup and reduce memory footprint. Domains are splitted by dot, the resulting array is reverted (so root domain goes first) and then the result is injected into a library generating and serializing tries/MAFSAs. Then matching is easy, just do the preprocessing and walk the DAG.
Should be useful when there are a lot of domains to be checked against, not 2 ones.
The DAG can also be easily constructed by hand using YAML sysntax.
If one needs, a trie can be easily transformed into a regular expression doing the same, which can work even faster because regexps can be jitted.
Also the extension should have the feature to retrieve the list by URI and automatically update it.