Open BurntSushi opened 2 years ago
I looked into this briefly, but the existing Aho-Corasick optimization is a giant hack. Ideally this would be automatically handled by the regex engine. But if I can't make that work, then we should try to make the optimization in ripgrep a bit more robust.
I took another quick peek at this and I think the right way to go about this would be to add an optimization in the meta regex engine in regex-automata
that specifically looks for patterns like <look-around-assertion>(alternation|of|literals)<look-around-assertion>
. And then create a prefilter for the alternation of literals. This has the benefit of working for both the -x
and -w
flags.
This wasn't possible before because the old regex engine didn't know how to resolve look-around assertions after a prefilter match.
One alternative that would be worth exploring is to see whether the existing literal extraction can be augmented for this case. It somewhat intentionally does not handle this case because it tries to keep its literal sets small, since large literal sets tend to be counter-productive.
Currently, when
-x/--line-regexp
and-F
are given to ripgrep, Aho-Corasick won't be used because the-x
flag turns each pattern into a regex via(?m)^(?:pattern)$
. This in turn causes the Aho-Corasick optimization to get defeated. In cases where the number of patterns is very large, using the regex engine for this much much slower than Aho-Corasick.See a discussion on this that motivated this ticket:
Discussed in https://github.com/BurntSushi/ripgrep/discussions/2244