Faster memchr, memchr2 and memchr3 in generic version

stepancheg commented 5 months ago

Current generic ("all") implementation checks that a chunk (usize) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops.

Context: we use memchr, but many of our strings are short. Currently SIMD-optimized memchr processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take usize bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.

BurntSushi commented 5 months ago

Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.

You'll want to use rebar. From the root of this repo on master, build the fallback benchmark engine and get a baseline:

$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv

Then checkout your PR, rebuild the benchmark engine and measure your change:

$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv

Then you can diff the results:

$ rebar diff before.csv after.csv
benchmark                          engine                       before.csv           after.csv
---------                          ------                       ----------           ---------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1189.8 MB/s (1.96x)  2.3 GB/s (1.00x)
memchr/sherlock/common/small1      rust/memchr/memchr/fallback  4.4 GB/s (1.00x)     4.0 GB/s (1.09x)
memchr/sherlock/common/tiny1       rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1462.3 MB/s (1.29x)
memchr/sherlock/never/huge1        rust/memchr/memchr/fallback  24.3 GB/s (1.06x)    25.8 GB/s (1.00x)
memchr/sherlock/never/small1       rust/memchr/memchr/fallback  18.7 GB/s (1.00x)    18.2 GB/s (1.03x)
memchr/sherlock/never/tiny1        rust/memchr/memchr/fallback  3.8 GB/s (1.00x)     3.8 GB/s (1.00x)
memchr/sherlock/never/empty1       rust/memchr/memchr/fallback  12.00ns (1.00x)      12.00ns (1.00x)
memchr/sherlock/rare/huge1         rust/memchr/memchr/fallback  24.5 GB/s (1.00x)    24.2 GB/s (1.01x)
memchr/sherlock/rare/small1        rust/memchr/memchr/fallback  17.7 GB/s (1.00x)    17.7 GB/s (1.00x)
memchr/sherlock/rare/tiny1         rust/memchr/memchr/fallback  3.6 GB/s (1.00x)     3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  6.5 GB/s (1.94x)     12.6 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  12.1 GB/s (1.00x)    11.7 GB/s (1.04x)
memchr/sherlock/uncommon/tiny1     rust/memchr/memchr/fallback  2.9 GB/s (1.00x)     2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  695.0 MB/s (1.87x)   1300.7 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  2.1 GB/s (1.00x)     1985.1 MB/s (1.09x)

Add a -t 1.1 to only show results with a 1.1x difference or bigger:

$ rebar diff before.csv after.csv -t 1.1
benchmark                         engine                       before.csv           after.csv
---------                         ------                       ----------           ---------
memchr/sherlock/common/huge1      rust/memchr/memchr/fallback  1189.8 MB/s (1.96x)  2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1      rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1462.3 MB/s (1.29x)
memchr/sherlock/rare/tiny1        rust/memchr/memchr/fallback  3.6 GB/s (1.00x)     3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1    rust/memchr/memchr/fallback  6.5 GB/s (1.94x)     12.6 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1    rust/memchr/memchr/fallback  2.9 GB/s (1.00x)     2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1  rust/memchr/memchr/fallback  695.0 MB/s (1.87x)   1300.7 MB/s (1.00x)

So I see a big improvement in memchr/sherlock/common/huge1, memchr/sherlock/uncommon/huge1 and memchr/sherlock/verycommon/huge1. Which I think makes sense, since they are benchmarks that test the case of "lots of matches" (of varying degrees) on a very large haystack.

But it looks like there's a smaller regression on memchr/sherlock/common/tiny1, memchr/sherlock/uncommon/tiny1 and memchr/sherlock/rare/tiny1. In these benchmarks, the haystack is quite small, so the benchmark tends to be dominated by constant factors (like setup costs). That might be worth looking into.

stepancheg commented 5 months ago

Here are my guesses.

I ran benchmark several times on my Apple M1 laptop, binaries built properly for arm64.

benchmark                          engine                       before2.csv           after2.csv
---------                          ------                       -----------           ----------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1032.0 MB/s (1.67x)   1719.7 MB/s (1.00x)
memchr/sherlock/common/small1      rust/memchr/memchr/fallback  3.0 GB/s (1.25x)      3.7 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  5.0 GB/s (1.54x)      7.7 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  7.5 GB/s (1.98x)      14.7 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     rust/memchr/memchr/fallback  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  611.3 MB/s (1.62x)    988.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1895.9 MB/s (1.00x)   1518.6 MB/s (1.25x)

It shows consistent regression on this benchmark:

memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1895.9 MB/s (1.00x)   1518.6 MB/s (1.25x)

which makes sense: it searches for space, and text is:

Mr. Sherlock Holmes, who was usually very late in the mornings, save
   ^        ^       ^   ^   ^       ^    ^    ^  ^             ^

so it often processes eight bytes at once with shifts and such instead of finding the character on fourth iteration.

About your results:

memchr/sherlock/common/tiny1 — searches for space, same issue
memchr/sherlock/rare/tiny1 — searches for . in a text which has only one ., if it is consistent, I cannot explain
memchr/sherlock/uncommon/tiny1 — searches for l, likely same issue as with space:

Mr. Sherlock Holmes, who was usually very late in the mornings, save
        ^      ^                 ^^       ^

stepancheg commented 5 months ago

Updated the comment above. Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.

BurntSushi commented 5 months ago

With the updated PR, on x86-64:

$ rebar diff before.csv after.csv -t 1.1
benchmark                         engine                       before.csv           after.csv
---------                         ------                       ----------           ---------
memchr/sherlock/common/huge1      rust/memchr/memchr/fallback  1187.8 MB/s (1.99x)  2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1      rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1495.5 MB/s (1.26x)
memchr/sherlock/never/tiny1       rust/memchr/memchr/fallback  3.8 GB/s (1.21x)     4.6 GB/s (1.00x)
memchr/sherlock/never/empty1      rust/memchr/memchr/fallback  12.00ns (1.00x)      18.00ns (1.50x)
memchr/sherlock/rare/tiny1        rust/memchr/memchr/fallback  3.6 GB/s (1.20x)     4.3 GB/s (1.00x)
memchr/sherlock/uncommon/huge1    rust/memchr/memchr/fallback  6.5 GB/s (2.19x)     14.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1    rust/memchr/memchr/fallback  3.1 GB/s (1.00x)     2.4 GB/s (1.29x)
memchr/sherlock/verycommon/huge1  rust/memchr/memchr/fallback  691.6 MB/s (1.88x)   1302.9 MB/s (1.00x)

And on my M2 mac mini:

$ rebar diff before.csv after.csv -t 1.1
benchmark                          engine                       before.csv            after.csv
---------                          ------                       ----------            ---------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1170.6 MB/s (1.61x)   1887.8 MB/s (1.00x)
memchr/sherlock/common/tiny1       rust/memchr/memchr/fallback  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/never/huge1        rust/memchr/memchr/fallback  25.3 GB/s (1.00x)     23.0 GB/s (1.10x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  6.0 GB/s (1.35x)      8.1 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  7.5 GB/s (1.98x)      14.7 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  692.4 MB/s (1.58x)    1096.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1901.6 MB/s (1.00x)   1688.6 MB/s (1.13x)

I think this is on balance a good change. I'm still a little worried about the regressions, but I generally side toward throughput gains.

Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.

Nice thank you!

BurntSushi commented 5 months ago

This PR is on crates.io in memchr 2.7.3.

CryZe commented 5 months ago

Looks like this broke all big endian targets. #152

BurntSushi commented 5 months ago

This PR was reverted in #153, and the revert was released as memchr 2.7.4 on crates.io.

stepancheg commented 5 months ago

Resubmitted as #154.

BurntSushi / memchr

Faster memchr, memchr2 and memchr3 in generic version #151