Closed BurntSushi closed 10 months ago
Cool! Before replacing bytecount with memchr_iter(..).count()
, please note that it also doesn't yet have any aarch64 optimization, so it's using usize-width code, no NEON intrinsics for now. I'm having a hard time to test and benchmark this stuff (I think my current draft PR fails on endianness or some such), as the only ARM CPU I have is in my phone, but I'll get a M2 on Friday, so stay tuned.
@llogiq Note that it's not just for aarch64
. There are improvements on big haystacks for x86_64
as well.
@llogiq If you want to run memchr's benchmarks with a specific focus on bytecount
, this should do the trick (from the root of this repo). The --test
command just runs a single iteration from each benchmark and is a good way to ensure everything is setup properly before collecting measurements.
$ cargo install --git https://github.com/BurntSushi/rebar rebar
$ rebar build -e '^rust/memchr/memchr/onlycount$' -e '^rust/bytecount/memchr/oneshot$'
rust/bytecount/memchr/oneshot: running: cd "benchmarks/./engines/rust-bytecount" && "cargo" "build" "--release"
rust/bytecount/memchr/oneshot: build complete for version 0.5.3
rust/memchr/memchr/onlycount: running: cd "benchmarks/./engines/rust-memchr" && "cargo" "build" "--release"
rust/memchr/memchr/onlycount: build complete for version 2.6.0
$ rebar measure -e '^rust/memchr/memchr/onlycount$' -e '^rust/bytecount/memchr/oneshot$' --test
[... snip ...]
$ rebar measure -e '^rust/memchr/memchr/onlycount$' -e '^rust/bytecount/memchr/oneshot$' | tee results.csv
[... snip ...]
$ rebar cmp results.csv
benchmark rust/bytecount/memchr/oneshot rust/memchr/memchr/onlycount
--------- ----------------------------- ----------------------------
memchr/sherlock/common/huge1 28.5 GB/s (2.08x) 59.3 GB/s (1.00x)
memchr/sherlock/common/small1 17.7 GB/s (1.25x) 22.1 GB/s (1.00x)
memchr/sherlock/common/tiny1 4.3 GB/s (1.00x) 3.8 GB/s (1.13x)
memchr/sherlock/never/huge1 28.5 GB/s (2.08x) 59.3 GB/s (1.00x)
memchr/sherlock/never/small1 17.7 GB/s (1.25x) 22.1 GB/s (1.00x)
memchr/sherlock/never/tiny1 4.3 GB/s (1.00x) 3.8 GB/s (1.13x)
memchr/sherlock/never/empty1 11.00ns (1.00x) 11.00ns (1.00x)
memchr/sherlock/rare/huge1 28.5 GB/s (2.08x) 59.3 GB/s (1.00x)
memchr/sherlock/rare/small1 17.7 GB/s (1.21x) 21.3 GB/s (1.00x)
memchr/sherlock/rare/tiny1 4.3 GB/s (1.00x) 3.8 GB/s (1.13x)
memchr/sherlock/uncommon/huge1 28.5 GB/s (2.08x) 59.3 GB/s (1.00x)
memchr/sherlock/uncommon/small1 17.7 GB/s (1.25x) 22.1 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1 4.3 GB/s (1.00x) 3.8 GB/s (1.13x)
memchr/sherlock/verycommon/huge1 28.5 GB/s (2.08x) 59.3 GB/s (1.00x)
memchr/sherlock/verycommon/small1 17.2 GB/s (1.29x) 22.1 GB/s (1.00x)
This PR doesn't just add
aarch64
-specific code, but it refactors pretty much everything about how the code is organized. There are big perf wins foraarch64
(see benchmark results below), and also latency improvements across the board. A brief summary of the changes in this PR:aarch64
NEON vector implementations formemchr
,memrchr
,memchr2
,memrchr2
,memchr3
,memrchr3
andmemmem
. This should lead to massive speed improvements on an increasing popular target, due in large part to Apple silicon.wasm32
simd128 vector implementations formemchr
,memrchr
,memchr2
,memrchr2
,memchr3
andmemrchr3
. (@alexcrichton previously contributed a vector implementation formemmem
and that remains.)x86_64
has no real additions other than thememchr_iter(needle, haystack).count()
specialization. It already has SSE2 and AVX2 implementations ofmemchr
(and friends) andmemmem
. It uses AVX2 automatically via runtime inspection of what the current CPU supports. There is no need to compile with theavx2
feature enabled.arch
sub-module that exposes a lot of the internal routines (including target specific routines) used to implementmemchr
andmemmem
. This module is part of a major refactoring of how this crate is organized and it seemed prudent to expose the internals as their APIs are pretty straight-forward. That is, there isn't a huge API design space IMO. This module includes scalar substring search implementations of Shift-Or, Rabin-Karp and Two-Way.cfg
knobs that the previous version ofmemchr
setup in its build script. This in turn allowed me to remove the build script entirely. Given the ubiquity of this crate, this may lead to compile time improvements downstream. (Likely small in each individual case but perhaps large in aggregate.) I can't promise that a build script will never re-appear, but I'll try to resist adding one in the future if possible.time rebar build -e '^rust/memchrold/memmem/prebuilt$'
reports 0.944 seconds on my system while a freshtime rebar build -e '^rust/memchr/memmem/prebuilt$'
reports 1.164 seconds. This is onx86_64
where no real additional code was added. This could be because of the "nicer" abstractions now present in thearch
sub-module or perhaps how the internals are structured. (Previously there were multiple monomorphic implementations ofmemchr
for example and now there is a single generic implementation that is monomorphized automatically by the compiler via generics. Perhaps that is more expensive?)memchr_iter(needle, haystack).count()
to use a different vector implementation that specifically only counts matches instead of reporting the offsets of each match. This can make huge (potentially over an order of magnitude) differences when counting the number of matches of a frequently (even semi-frequently) occurring byte in a large haystack. This is effectively what thebytecount
crate does (which is what ripgrep currently uses to compute line numbers for matches), but the marginal cost of adding it to thememchr
crate was very low. So I did. And I plan to move ripgrep to usingmemchr_iter(needle, haystack).count()
. (Also, the benchmarks below suggest that the counting implementation I wrote is faster than the one inbytecount
in some cases which look like they'll be relevant for ripgrep. This was surprising to me.)alloc
feature which permits compiling this crate without the standard library but with thealloc
crate. This crate is designed through-and-through to work in a core-only context, so this doesn't unlock much compared to just disabling thestd
feature. It adds a couple of APIs requiring allocation (likememmem::Finder::into_owned
) and other things likearch::all::shiftor
which really want an allocation to store its bit-parallel state machine.libc
feature is DEPRECATED and is now a no-op. I don't think there is any real benefit to it any more.logging
feature has been added. When enabled, this crate will emit a smattering of log messages. Usually these messages are used to indicate what kind of strategy is selected. For example, whether a vector or scalar algorithm is used for substring search.Selected benchmark differences for
x86_64
Differences across the board from the status quo. Showing only measurements with a 1.2x (or greater) difference.
A comparison with the
sliceslice
crate for just substring search. We only include measurements with a 1.2x difference or greater.Differences with the substring search implementation and
memmem
as provided by GNU libc. Showing only measurements with 2x difference or greater.Differences with the
bytecount
crate asmemchr_iter(needle, haystack).count()
is now specialized to its own vector implementation just for counting the number of matches (instead of reporting the offset of each match). The thoughput improvements as compared tobytecount
on large haystacks are most interesting IMO. (I was somewhat surprised by this, asbytecount
seems to do something clever whilememchr_iter(needle, haystack).count()
is basically justmemchr
but with the branching for reporting matches removed.) Either way, I expect this to translate directly to improvements in ripgrep, although I haven't measured that yet.Selected benchmark differences for
aarch64
Differences across the board from the status quo. Note that here, I've only included measurements with a 4x difference from the old memchr crate. Otherwise, pretty much every benchmark has a pretty sizeable improvement from the old version. (Because previously,
aarch64
had no vector implementations at all.)A comparison with the
sliceslice
crate, which has its own customaarch64
vector implementation of substring search. We only show measurements with 1.2x or greater difference.Differences with the substring search implementation and
memmem
as provided by macOS's libc. Showing only measurements with 2x difference or greater. This is what utter destruction looks like. (I'm not sure what's going on in benchmarks likememmem/subtitles/rare/teeny-en-sherlock-holmes
. It's a tiny haystack and macOS seems to either measure 1ns or 41ns. I wonder if there's something odd about time precision on macOS? You can see the reverse happen inmemmem/subtitles/rare/teeny-zh-sherlock
.)Differences with the
bytecount
crate asmemchr_iter(needle, haystack).count()
is now specialized to its own vector implementation just for counting the number of matches (instead of reporting the offset of each match).Selected Benchmark differences for
regex
onaarch64
This shows benchmark results before and after this change for the
regex
crate. We only show results with a difference of 1.2x or greater.Improvements to ripgrep on
aarch64
In short, simple ripgrep searches (likely the most common kind) get about twice as fast on Apple silicon now.