BurntSushi / ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore
The Unlicense
48.73k stars 2k forks source link

use aho-corasick when -F and -x are set #2247

Open BurntSushi opened 2 years ago

BurntSushi commented 2 years ago

Currently, when -x/--line-regexp and -F are given to ripgrep, Aho-Corasick won't be used because the -x flag turns each pattern into a regex via (?m)^(?:pattern)$. This in turn causes the Aho-Corasick optimization to get defeated. In cases where the number of patterns is very large, using the regex engine for this much much slower than Aho-Corasick.

See a discussion on this that motivated this ticket:

Discussed in https://github.com/BurntSushi/ripgrep/discussions/2244

Originally posted by **T145** June 24, 2022 I'm aware that ripgrep has set memory limits that [can cause problems](https://github.com/BurntSushi/ripgrep/issues/362#issuecomment-355848324), and that has been encountered while trying to perform the following operation: ``` rg -NFxvf source.txt target.txt ``` Where the source is `~8MB` and the target is `~168MB`. What should be the memory allocations I use in order to handle this workload?
BurntSushi commented 2 years ago

I looked into this briefly, but the existing Aho-Corasick optimization is a giant hack. Ideally this would be automatically handled by the regex engine. But if I can't make that work, then we should try to make the optimization in ripgrep a bit more robust.

BurntSushi commented 11 months ago

I took another quick peek at this and I think the right way to go about this would be to add an optimization in the meta regex engine in regex-automata that specifically looks for patterns like <look-around-assertion>(alternation|of|literals)<look-around-assertion>. And then create a prefilter for the alternation of literals. This has the benefit of working for both the -x and -w flags.

This wasn't possible before because the old regex engine didn't know how to resolve look-around assertions after a prefilter match.

One alternative that would be worth exploring is to see whether the existing literal extraction can be augmented for this case. It somewhat intentionally does not handle this case because it tries to keep its literal sets small, since large literal sets tend to be counter-productive.