Closed stepancheg closed 5 months ago
Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.
You'll want to use rebar. From the root of this repo on master
, build the fallback benchmark engine and get a baseline:
$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv
Then checkout your PR, rebuild the benchmark engine and measure your change:
$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv
Then you can diff the results:
$ rebar diff before.csv after.csv
benchmark engine before.csv after.csv
--------- ------ ---------- ---------
memchr/sherlock/common/huge1 rust/memchr/memchr/fallback 1189.8 MB/s (1.96x) 2.3 GB/s (1.00x)
memchr/sherlock/common/small1 rust/memchr/memchr/fallback 4.4 GB/s (1.00x) 4.0 GB/s (1.09x)
memchr/sherlock/common/tiny1 rust/memchr/memchr/fallback 1880.1 MB/s (1.00x) 1462.3 MB/s (1.29x)
memchr/sherlock/never/huge1 rust/memchr/memchr/fallback 24.3 GB/s (1.06x) 25.8 GB/s (1.00x)
memchr/sherlock/never/small1 rust/memchr/memchr/fallback 18.7 GB/s (1.00x) 18.2 GB/s (1.03x)
memchr/sherlock/never/tiny1 rust/memchr/memchr/fallback 3.8 GB/s (1.00x) 3.8 GB/s (1.00x)
memchr/sherlock/never/empty1 rust/memchr/memchr/fallback 12.00ns (1.00x) 12.00ns (1.00x)
memchr/sherlock/rare/huge1 rust/memchr/memchr/fallback 24.5 GB/s (1.00x) 24.2 GB/s (1.01x)
memchr/sherlock/rare/small1 rust/memchr/memchr/fallback 17.7 GB/s (1.00x) 17.7 GB/s (1.00x)
memchr/sherlock/rare/tiny1 rust/memchr/memchr/fallback 3.6 GB/s (1.00x) 3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1 rust/memchr/memchr/fallback 6.5 GB/s (1.94x) 12.6 GB/s (1.00x)
memchr/sherlock/uncommon/small1 rust/memchr/memchr/fallback 12.1 GB/s (1.00x) 11.7 GB/s (1.04x)
memchr/sherlock/uncommon/tiny1 rust/memchr/memchr/fallback 2.9 GB/s (1.00x) 2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1 rust/memchr/memchr/fallback 695.0 MB/s (1.87x) 1300.7 MB/s (1.00x)
memchr/sherlock/verycommon/small1 rust/memchr/memchr/fallback 2.1 GB/s (1.00x) 1985.1 MB/s (1.09x)
Add a -t 1.1
to only show results with a 1.1x
difference or bigger:
$ rebar diff before.csv after.csv -t 1.1
benchmark engine before.csv after.csv
--------- ------ ---------- ---------
memchr/sherlock/common/huge1 rust/memchr/memchr/fallback 1189.8 MB/s (1.96x) 2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1 rust/memchr/memchr/fallback 1880.1 MB/s (1.00x) 1462.3 MB/s (1.29x)
memchr/sherlock/rare/tiny1 rust/memchr/memchr/fallback 3.6 GB/s (1.00x) 3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1 rust/memchr/memchr/fallback 6.5 GB/s (1.94x) 12.6 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1 rust/memchr/memchr/fallback 2.9 GB/s (1.00x) 2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1 rust/memchr/memchr/fallback 695.0 MB/s (1.87x) 1300.7 MB/s (1.00x)
So I see a big improvement in memchr/sherlock/common/huge1
, memchr/sherlock/uncommon/huge1
and memchr/sherlock/verycommon/huge1
. Which I think makes sense, since they are benchmarks that test the case of "lots of matches" (of varying degrees) on a very large haystack.
But it looks like there's a smaller regression on memchr/sherlock/common/tiny1
, memchr/sherlock/uncommon/tiny1
and memchr/sherlock/rare/tiny1
. In these benchmarks, the haystack is quite small, so the benchmark tends to be dominated by constant factors (like setup costs). That might be worth looking into.
Here are my guesses.
I ran benchmark several times on my Apple M1 laptop, binaries built properly for arm64.
benchmark engine before2.csv after2.csv
--------- ------ ----------- ----------
memchr/sherlock/common/huge1 rust/memchr/memchr/fallback 1032.0 MB/s (1.67x) 1719.7 MB/s (1.00x)
memchr/sherlock/common/small1 rust/memchr/memchr/fallback 3.0 GB/s (1.25x) 3.7 GB/s (1.00x)
memchr/sherlock/uncommon/huge1 rust/memchr/memchr/fallback 5.0 GB/s (1.54x) 7.7 GB/s (1.00x)
memchr/sherlock/uncommon/small1 rust/memchr/memchr/fallback 7.5 GB/s (1.98x) 14.7 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1 rust/memchr/memchr/fallback 1605.0 MB/s (41.00x) 64.3 GB/s (1.00x)
memchr/sherlock/verycommon/huge1 rust/memchr/memchr/fallback 611.3 MB/s (1.62x) 988.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1 rust/memchr/memchr/fallback 1895.9 MB/s (1.00x) 1518.6 MB/s (1.25x)
It shows consistent regression on this benchmark:
memchr/sherlock/verycommon/small1 rust/memchr/memchr/fallback 1895.9 MB/s (1.00x) 1518.6 MB/s (1.25x)
which makes sense: it searches for space, and text is:
Mr. Sherlock Holmes, who was usually very late in the mornings, save
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
so it often processes eight bytes at once with shifts and such instead of finding the character on fourth iteration.
About your results:
memchr/sherlock/common/tiny1
— searches for space, same issuememchr/sherlock/rare/tiny1
— searches for .
in a text which has only one .
, if it is consistent, I cannot explainmemchr/sherlock/uncommon/tiny1
— searches for l
, likely same issue as with space:Mr. Sherlock Holmes, who was usually very late in the mornings, save
^ ^ ^^ ^
Updated the comment above. Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.
With the updated PR, on x86-64
:
$ rebar diff before.csv after.csv -t 1.1
benchmark engine before.csv after.csv
--------- ------ ---------- ---------
memchr/sherlock/common/huge1 rust/memchr/memchr/fallback 1187.8 MB/s (1.99x) 2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1 rust/memchr/memchr/fallback 1880.1 MB/s (1.00x) 1495.5 MB/s (1.26x)
memchr/sherlock/never/tiny1 rust/memchr/memchr/fallback 3.8 GB/s (1.21x) 4.6 GB/s (1.00x)
memchr/sherlock/never/empty1 rust/memchr/memchr/fallback 12.00ns (1.00x) 18.00ns (1.50x)
memchr/sherlock/rare/tiny1 rust/memchr/memchr/fallback 3.6 GB/s (1.20x) 4.3 GB/s (1.00x)
memchr/sherlock/uncommon/huge1 rust/memchr/memchr/fallback 6.5 GB/s (2.19x) 14.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1 rust/memchr/memchr/fallback 3.1 GB/s (1.00x) 2.4 GB/s (1.29x)
memchr/sherlock/verycommon/huge1 rust/memchr/memchr/fallback 691.6 MB/s (1.88x) 1302.9 MB/s (1.00x)
And on my M2 mac mini:
$ rebar diff before.csv after.csv -t 1.1
benchmark engine before.csv after.csv
--------- ------ ---------- ---------
memchr/sherlock/common/huge1 rust/memchr/memchr/fallback 1170.6 MB/s (1.61x) 1887.8 MB/s (1.00x)
memchr/sherlock/common/tiny1 rust/memchr/memchr/fallback 1605.0 MB/s (41.00x) 64.3 GB/s (1.00x)
memchr/sherlock/never/huge1 rust/memchr/memchr/fallback 25.3 GB/s (1.00x) 23.0 GB/s (1.10x)
memchr/sherlock/uncommon/huge1 rust/memchr/memchr/fallback 6.0 GB/s (1.35x) 8.1 GB/s (1.00x)
memchr/sherlock/uncommon/small1 rust/memchr/memchr/fallback 7.5 GB/s (1.98x) 14.7 GB/s (1.00x)
memchr/sherlock/verycommon/huge1 rust/memchr/memchr/fallback 692.4 MB/s (1.58x) 1096.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1 rust/memchr/memchr/fallback 1901.6 MB/s (1.00x) 1688.6 MB/s (1.13x)
I think this is on balance a good change. I'm still a little worried about the regressions, but I generally side toward throughput gains.
Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.
Nice thank you!
This PR is on crates.io in memchr 2.7.3
.
Looks like this broke all big endian targets. #152
This PR was reverted in #153, and the revert was released as memchr 2.7.4
on crates.io.
Resubmitted as #154.
Current generic ("all") implementation checks that a chunk (
usize
) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops.Context: we use
memchr
, but many of our strings are short. Currently SIMD-optimizedmemchr
processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we takeusize
bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.