Closed BurntSushi closed 1 year ago
Out of curiosity, why not use something like the wide
crate?
@itamarst a few reasons:
aho-corasick
.)wide
crate does not provide the operations necessary for Teddy. The critical ops are shuffles and palignr
.wide
crate works via safe_arch
and that in turn only uses compile-time knowledge of what SIMD instructions are available. This is critically inappropriate in pretty much all cases except for when you own the compile step for all your users. It would mean, for example, that most users of ripgrep wouldn't benefit from SIMD optimizations such as Teddy. In this PR (and before), SIMD support is detected at runtime, regardless of what options you compile the crate with. (Now technically, NEON is part of aarch64
, so safe_arch
would be appropriate in that specific case, but this doesn't mitigate (1) and (2) above. And since (3) applies to x86-64
, there's no real benefit to using wide
even if this was the only concern.)
Up until this point, Teddy was explicitly written using
x86-64
SIMD routines. Specifically, ones from SSSE3 and AVX2. This PR shuffles Teddy's main implementation into code that is generic over a newVector
trait, and provides implementations of thatVector
trait forx86-64
's__m128i
and__m256i
, in addition toaarch64
'su8x16_t
vector type. In effect, this greatly speeds up searches for a small number of patterns automatically onaarch64
(i.e., on Apple's new M1 and M2 chips).An ad hoc ripgrep benchmark is worth a thousand words. On my M2 mac mini:
This PR also drops criterion in favor of
rebar
for benchmarking, which is specialized to the task of regex/substring searching. In that vein, we can look at top-levelAhoCorasick
benchmarks before and after:Basically, there are 2-10x improvements across the board. These primarily apply to throughput where you expect matches to occur relatively rarely with respect to the size of the haystack.
For
x86_64
, there might be some small latency improvements. And there were a few tweaks to the various prefilter heuristics uses. But one should generally expect comparable performance to what came before this PR. If you notice any meaningful regressions, please open a new issue with enough detail for me to reproduce the problem.This PR also makes it possible for Teddy to be pretty easily ported to other vector types as well. I took a look at wasm and it's not obvious that it has the right routines to make it work, but I probably spent all of 10 minutes doing a quick skim. I'm not a wasm expert, so if anyone has a good handle on wasm32 SIMD, you might try your hand at implementing the
Vector
trait. (If you need help, please open an issue.)(I do hope to get a new ripgrep release out soon with this improvement and an analogous improvement to
aarch64
in thememchr
crate.)