add aarch64 SIMD implementation of Teddy

BurntSushi commented 1 year ago

Up until this point, Teddy was explicitly written using x86-64 SIMD routines. Specifically, ones from SSSE3 and AVX2. This PR shuffles Teddy's main implementation into code that is generic over a new Vector trait, and provides implementations of that Vector trait for x86-64's __m128i and __m256i, in addition to aarch64's u8x16_t vector type. In effect, this greatly speeds up searches for a small number of patterns automatically on aarch64 (i.e., on Apple's new M1 and M2 chips).

An ad hoc ripgrep benchmark is worth a thousand words. On my M2 mac mini:

$ time rg-before-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
3055

real    8.196
user    7.726
sys     0.469
maxmem  5728 MB
faults  17

$ time rg-after-teddy-aarch64 -i -c 'Sherlock Holmes' OpenSubtitles2018.half.en
3055

real    1.127
user    0.701
sys     0.425
maxmem  4880 MB
faults  13

This PR also drops criterion in favor of rebar for benchmarking, which is specialized to the task of regex/substring searching. In that vein, we can look at top-level AhoCorasick benchmarks before and after:

benchmark                      rust/old-aho-corasick/default/leftmost-first  rust/aho-corasick/default/leftmost-first
---------                      --------------------------------------------  ----------------------------------------
curated/sherlock-en            23.7 GB/s (1.00x)                             23.8 GB/s (1.00x)
curated/sherlock-casei-en      659.7 MB/s (12.21x)                           7.9 GB/s (1.00x)
curated/sherlock-ru            24.3 GB/s (1.00x)                             24.3 GB/s (1.00x)
curated/sherlock-casei-ru      4.5 GB/s (1.43x)                              6.5 GB/s (1.00x)
curated/sherlock-zh            28.9 GB/s (1.00x)                             28.9 GB/s (1.00x)
curated/alt-sherlock-en        659.7 MB/s (12.05x)                           7.8 GB/s (1.00x)
curated/alt-sherlock-casei-en  640.0 MB/s (4.36x)                            2.7 GB/s (1.00x)
curated/alt-sherlock-ru        659.8 MB/s (7.92x)                            5.1 GB/s (1.00x)
curated/alt-sherlock-casei-ru  604.0 MB/s (2.48x)                            1497.8 MB/s (1.00x)
curated/alt-sherlock-zh        663.1 MB/s (13.47x)                           8.7 GB/s (1.00x)
curated/dictionary-15          172.7 MB/s (1.05x)                            181.1 MB/s (1.00x)
sherlock/name-alt1             29.4 GB/s (1.01x)                             29.8 GB/s (1.00x)
sherlock/name-alt2             10.7 GB/s (1.00x)                             9.8 GB/s (1.08x)
sherlock/name-alt3             652.7 MB/s (10.47x)                           6.7 GB/s (1.00x)
sherlock/name-alt4             11.1 GB/s (1.00x)                             10.8 GB/s (1.03x)
sherlock/name-alt5             6.9 GB/s (1.01x)                              7.0 GB/s (1.00x)
sherlock/name-alt6             83.1 GB/s (1.00x)                             83.1 GB/s (1.00x)
sherlock/name-alt7             46.3 GB/s (1.00x)                             46.3 GB/s (1.00x)
sherlock/name-nocase1          637.2 MB/s (2.55x)                            1623.7 MB/s (1.00x)
sherlock/name-nocase2          643.7 MB/s (9.10x)                            5.7 GB/s (1.00x)
sherlock/name-nocase3          641.9 MB/s (3.21x)                            2.0 GB/s (1.00x)
sherlock/words5000             145.8 MB/s (1.01x)                            147.7 MB/s (1.00x)

Basically, there are 2-10x improvements across the board. These primarily apply to throughput where you expect matches to occur relatively rarely with respect to the size of the haystack.

For x86_64, there might be some small latency improvements. And there were a few tweaks to the various prefilter heuristics uses. But one should generally expect comparable performance to what came before this PR. If you notice any meaningful regressions, please open a new issue with enough detail for me to reproduce the problem.

This PR also makes it possible for Teddy to be pretty easily ported to other vector types as well. I took a look at wasm and it's not obvious that it has the right routines to make it work, but I probably spent all of 10 minutes doing a quick skim. I'm not a wasm expert, so if anyone has a good handle on wasm32 SIMD, you might try your hand at implementing the Vector trait. (If you need help, please open an issue.)

(I do hope to get a new ripgrep release out soon with this improvement and an analogous improvement to aarch64 in the memchr crate.)

itamarst commented 1 year ago

Out of curiosity, why not use something like the wide crate?

BurntSushi commented 1 year ago

@itamarst a few reasons:

I don't take dependencies lightly. It has to be extremely compelling. (My standard here is very intentionally impossibly high for core crates like aho-corasick.)
From a quick glance, the wide crate does not provide the operations necessary for Teddy. The critical ops are shuffles and palignr.
The wide crate works via safe_arch and that in turn only uses compile-time knowledge of what SIMD instructions are available. This is critically inappropriate in pretty much all cases except for when you own the compile step for all your users. It would mean, for example, that most users of ripgrep wouldn't benefit from SIMD optimizations such as Teddy. In this PR (and before), SIMD support is detected at runtime, regardless of what options you compile the crate with. (Now technically, NEON is part of aarch64, so safe_arch would be appropriate in that specific case, but this doesn't mitigate (1) and (2) above. And since (3) applies to x86-64, there's no real benefit to using wide even if this was the only concern.)

BurntSushi / aho-corasick

add aarch64 SIMD implementation of Teddy #129