Closed cgbur closed 6 months ago
For context, the 2.6 release did a complete refactoring of all the SIMD code in this crate. The algorithms themselves didn't really change, but pretty much everything around them did. This was done to facilitate aarch64 and wasm32 support, as the previous infrastructure wasn't really amenable to supporting other targets easily. The unfortunate aspect of this refactoring is that there really is no simple way to compare the two. Your best bet is to look at the Assembly of the hottest code and see if there are any relevant codegen differences at that level. Unfortunately, I've found that it is sometimes the case that branches can get flipped around for seemingly unrelated changes in the code itself. So this may be a difficult thing to fix.
Now I did to a pretty thorough benchmark comparison between the old and the new. You can even explore the results yourself with rebar in a checkout of this repo:
$ rebar cmp benchmarks/record/x86_64/2023-08-27.csv -e '^rust/memchr/memmem/prebuilt' -e '^rust/memchrold/memmem/prebuilt' -t 1.1
benchmark rust/memchr/memmem/prebuilt rust/memchrold/memmem/prebuilt
--------- --------------------------- ------------------------------
memmem/code/rust-library-never-fn-strength 37.7 GB/s (1.13x) 42.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux 55.7 GB/s (1.00x) 38.7 GB/s (1.44x)
memmem/code/rust-library-rare-fn-from-str 53.6 GB/s (1.00x) 40.3 GB/s (1.33x)
memmem/code/rust-library-common-fn-is-empty 52.8 GB/s (1.00x) 38.3 GB/s (1.38x)
memmem/code/rust-library-common-fn 28.3 GB/s (1.00x) 21.7 GB/s (1.30x)
memmem/code/rust-library-common-let 19.6 GB/s (1.00x) 16.1 GB/s (1.22x)
memmem/pathological/md5-huge-no-hash 50.3 GB/s (1.00x) 42.1 GB/s (1.20x)
memmem/pathological/rare-repeated-huge-tricky 63.4 GB/s (1.00x) 40.9 GB/s (1.55x)
memmem/pathological/rare-repeated-huge-match 1853.9 MB/s (1.00x) 1574.8 MB/s (1.18x)
memmem/pathological/rare-repeated-small-tricky 25.9 GB/s (1.00x) 22.7 GB/s (1.14x)
memmem/pathological/rare-repeated-small-match 1811.4 MB/s (1.00x) 1468.7 MB/s (1.23x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 1238.3 MB/s (1.00x) 1101.1 MB/s (1.12x)
memmem/sliceslice/short 6.53ms (1.00x) 14.65ms (2.24x)
memmem/sliceslice/seemingly-random 11.3 MB/s (1.00x) 9.1 MB/s (1.24x)
memmem/subtitles/common/huge-en-you 15.5 GB/s (1.00x) 10.9 GB/s (1.42x)
memmem/subtitles/common/huge-en-one-space 1339.0 MB/s (1.12x) 1503.9 MB/s (1.00x)
memmem/subtitles/common/huge-ru-that 37.1 GB/s (1.00x) 29.6 GB/s (1.25x)
memmem/subtitles/common/huge-zh-that 26.0 GB/s (1.10x) 28.8 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not 17.2 GB/s (1.00x) 14.4 GB/s (1.19x)
memmem/subtitles/common/huge-zh-one-space 4.7 GB/s (1.10x) 5.2 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson 63.6 GB/s (1.00x) 41.3 GB/s (1.54x)
memmem/subtitles/never/huge-en-all-common-bytes 52.7 GB/s (1.00x) 41.8 GB/s (1.26x)
memmem/subtitles/never/huge-en-some-rare-bytes 63.6 GB/s (1.00x) 43.0 GB/s (1.48x)
memmem/subtitles/never/huge-en-two-space 62.2 GB/s (1.00x) 36.2 GB/s (1.72x)
memmem/subtitles/never/teeny-en-john-watson 1780.2 MB/s (1.00x) 1161.0 MB/s (1.53x)
memmem/subtitles/never/teeny-en-all-common-bytes 1780.2 MB/s (1.00x) 1161.0 MB/s (1.53x)
memmem/subtitles/never/teeny-en-some-rare-bytes 1668.9 MB/s (1.00x) 1161.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-two-space 1780.2 MB/s (1.00x) 1161.0 MB/s (1.53x)
memmem/subtitles/never/huge-ru-john-watson 63.5 GB/s (1.00x) 34.3 GB/s (1.85x)
memmem/subtitles/never/teeny-ru-john-watson 2.4 GB/s (1.00x) 1741.5 MB/s (1.44x)
memmem/subtitles/never/huge-zh-john-watson 60.1 GB/s (1.00x) 42.4 GB/s (1.42x)
memmem/subtitles/never/teeny-zh-john-watson 1970.9 MB/s (1.00x) 1285.4 MB/s (1.53x)
memmem/subtitles/rare/huge-en-sherlock 60.1 GB/s (1.00x) 40.8 GB/s (1.47x)
memmem/subtitles/rare/huge-en-medium-needle 54.1 GB/s (1.00x) 40.6 GB/s (1.33x)
memmem/subtitles/rare/huge-en-long-needle 43.7 GB/s (1.00x) 2.5 GB/s (17.22x)
memmem/subtitles/rare/huge-en-huge-needle 46.6 GB/s (1.00x) 2.3 GB/s (20.52x)
memmem/subtitles/rare/teeny-en-sherlock-holmes 1570.8 MB/s (1.00x) 1027.0 MB/s (1.53x)
memmem/subtitles/rare/teeny-en-sherlock 1271.6 MB/s (1.00x) 953.7 MB/s (1.33x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 63.4 GB/s (1.00x) 41.9 GB/s (1.51x)
memmem/subtitles/rare/huge-ru-sherlock 61.9 GB/s (1.00x) 40.6 GB/s (1.52x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes 2.1 GB/s (1.00x) 1430.5 MB/s (1.47x)
memmem/subtitles/rare/teeny-ru-sherlock 1602.2 MB/s (1.00x) 1251.7 MB/s (1.28x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 37.0 GB/s (1.14x) 42.3 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock 1137.1 MB/s (1.00x) 895.9 MB/s (1.27x)
As you can see, the new code is faster in the vast majority of cases. But there are some cases where the old code is seemingly a touch faster. But the differences are overall pretty small. The other thing to note here is that improvements in one case can lead to regressions in another.
I will note that I've done all my benchmarking for x86_64
on Intel CPUs where as you seem to have an AMD CPU. So the differences may not line up.
Eyeballing it, I'd say all of your perf differences except for g
look "close enough" to be within noise or are otherwise acceptable from my perspective.
I only really see two paths forward here. Either you can get me a repro and I can look into it, or you'll have to dive in and look into it yourself. I'm not really sure there's another way. Perf regressions like this really really need a repro of some kind. I know it can be a lot of work to produce one.
Ok I have made a tiny reproducible example:
cargo new --bin bench-memchr
cd bench-memchr
curl -o frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt
Cargo.toml
[package]
name = "bench-memchr"
version = "0.1.0"
edition = "2021"
[dependencies]
memchr = "=2.6.4" # or =2.5.0 when testing that version
main.rs
use memchr::memmem::Finder;
use std::{fs::File, io::Read, time::Instant};
const NEEDLE: &[u8] = b"Burnt";
const NUM_ITERATIONS: usize = 100000;
const FILE_NAME: &str = "frankenstein.txt";
fn main() {
let mut file = File::open(FILE_NAME).unwrap();
let mut buffer = Vec::new();
file.read_to_end(&mut buffer).unwrap();
let finder = Finder::new(NEEDLE);
let mut count = 0;
let start = Instant::now();
for _ in 0..NUM_ITERATIONS {
count += search(&buffer, &finder);
}
let end = start.elapsed();
println!("Found {} matches in {:?}", count, end);
let bytes_per_sec = (buffer.len() * NUM_ITERATIONS) as f64 / end.as_secs_f64();
println!("{} GiB/s", bytes_per_sec / 1024.0 / 1024.0 / 1024.0);
}
#[inline(never)]
fn search(buf: &[u8], finder: &Finder) -> usize {
let mut count = 0;
for _ in finder.find_iter(buf) {
count += 1;
}
count
}
And running this:
❯ hyperfine './memchr-2.5.0' './memchr-2.6.4'
Benchmark 1: ./memchr-2.5.0
Time (mean ± σ): 954.0 ms ± 23.6 ms [User: 952.9 ms, System: 1.1 ms]
Range (min … max): 926.2 ms … 1005.8 ms 10 runs
Benchmark 2: ./memchr-2.6.4
Time (mean ± σ): 1.296 s ± 0.032 s [User: 1.295 s, System: 0.001 s]
Range (min … max): 1.262 s … 1.343 s 10 runs
Summary
./memchr-2.5.0 ran
1.36 ± 0.05 times faster than ./memchr-2.6.4
The reported speed difference is ~32GiB/s vs ~42GiB/s
I tried looking at the assembly outputs, but struggling a bit to know what I'm looking at or looking for.
Here are the two flamegraphs:
1.5.0:
1.6.4:
Thanks! I'll take a look later.
For Assembly, the best way in my experience is to use perf
on Linux. It should be pretty clear where the hotspot is in functions like memchr::arch::x86_64::avx2::packedpair::Finder::find_impl
. Look for the SIMD instructions starting with v
. That should be where most of the time is being spent. Then compare those with the instructions used in memchr 2.5.0
.
But I'll take a look now that I have a repro. Just not sure when. Thank you!
2.5
│280:┌─→add $0x20,%rcx ▒
7.60 │ │ cmp %rsi,%rcx ▒
3.78 │ │↑ ja 123 ▒
│28d:│ vpcmpeqb (%rcx,%rdx,1),%ymm0,%ymm2 ▒
9.68 │ │ vpcmpeqb (%rcx,%rdi,1),%ymm1,%ymm3 ▒
6.71 │ │ vpand %ymm2,%ymm3,%ymm2 ▒
3.78 │ │ vpmovmskb %ymm2,%r12d ▒
1.37 │ │ test %r12d,%r12d ▒
59.11 │ └──je 280
2.6
0.01 │420:┌─→add $0x20,%rcx ▒
2.13 │ │ cmp %rdx,%rcx ▒
0.02 │ │↑ ja 10b ▒
│42d:│ mov 0x10(%rsp),%rax ▒
4.02 │ │ vpcmpeqb (%rcx,%rax,1),%ymm0,%ymm2 ▒
9.59 │ │ vpcmpeqb (%rcx,%rbx,1),%ymm1,%ymm3 ▒
3.62 │ │ vpand %ymm2,%ymm3,%ymm2 ◆
2.68 │ │ vpmovmskb %ymm2,%ebp ▒
0.44 │ │ test %ebp,%ebp ▒
69.99 │ └──je 420
This appears to be the hot loop. Extra move instruction snuck its way in not sure what thats for. Unfortunately this ec2 machine does not let me see perf counters so I cant look at branch prediction etc. I can try later on my own hardware and recreate this then I'd have more info. This is about as deep of an analysis as I'll do without more info because I'd just be guessing!
In my case, there is a regression, but it is much smaller:
$ hyperfine './target/release/i139-2.5.0' './target/release/i139-2.6.4'
Benchmark 1: ./target/release/i139-2.5.0
Time (mean ± σ): 1.065 s ± 0.049 s [User: 1.064 s, System: 0.001 s]
Range (min … max): 1.009 s … 1.196 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./target/release/i139-2.6.4
Time (mean ± σ): 1.128 s ± 0.014 s [User: 1.127 s, System: 0.000 s]
Range (min … max): 1.109 s … 1.142 s 10 runs
Summary
'./target/release/i139-2.5.0' ran
1.06 ± 0.05 times faster than './target/release/i139-2.6.4'
If I look at the codegen, I can indeed see an extra mov
instruction. Actually, I see two of them. In 2.5.0, the hot loop contains 9 instructions where as in 2.6.4 it has 11 instructions (2 extra mov
). For 2.5.0:
And 2.6.4:
The extent to which these extra instructions actually impacts perf can be highly variable. As can be seen here, we are seeing two very different scales of regression here. When I re-arranged the code, I do recall noticing the extra instructions (because I compared the codegen), but concluded that it was okay because the benchmarks didn't seem to suggest it mattered much. But perhaps that is not always the case. I don't know if the extra mov
instructions are the culprit here or not, but it's the only obvious difference I can see between 2.5.0 and 2.6.4 on this particular benchmark.
The unfortunate bit is that getting rid of those extra mov
instructions is a bit of a black art... Probably because I don't really understand why they show up there in the first place. We probably need an llvm expert to unravel that mystery. Anyway, I'll take a crack at trying to adjust the code to get rid of the mov
instructions.
I just tested on my machine:
AMD Ryzen 9 5900X 12-Core Processor
confirm the performance regression remains:
❯ hyperfine "./memchr-2.5.0" "./memchr-2.6.4"
Benchmark 1: ./memchr-2.5.0
Time (mean ± σ): 758.6 ms ± 14.0 ms [User: 756.4 ms, System: 1.8 ms]
Range (min … max): 730.5 ms … 779.8 ms 10 runs
Benchmark 2: ./memchr-2.6.4
Time (mean ± σ): 915.3 ms ± 9.3 ms [User: 914.0 ms, System: 0.9 ms]
Range (min … max): 905.8 ms … 932.6 ms 10 runs
Summary
./memchr-2.5.0 ran
1.21 ± 0.03 times faster than ./memchr-2.6.4
Using https://github.com/andrewrk/poop you can see it appears to mostly be executing more instructions, so nothing to do with cache/branch misses it seems.
❯ poop "./memchr-2.5.0" "./memchr-2.6.4" -d 20000
Benchmark 1 (26 runs): ./memchr-2.5.0
measurement mean ± σ min … max outliers delta
wall_time 774ms ± 14.9ms 738ms … 797ms 0 ( 0%) 0%
peak_rss 2.36MB ± 0 2.36MB … 2.36MB 0 ( 0%) 0%
cpu_cycles 3.54G ± 53.1M 3.41G … 3.61G 0 ( 0%) 0%
instructions 12.9G ± 324 12.9G … 12.9G 0 ( 0%) 0%
cache_references 1.03G ± 101M 763M … 1.16G 1 ( 4%) 0%
cache_misses 20.1M ± 4.45M 11.3M … 30.3M 0 ( 0%) 0%
branch_misses 14.0M ± 84.2K 13.9M … 14.1M 0 ( 0%) 0%
Benchmark 2 (22 runs): ./memchr-2.6.4
measurement mean ± σ min … max outliers delta
wall_time 924ms ± 25.2ms 867ms … 973ms 1 ( 5%) 💩+ 19.3% ± 1.5%
peak_rss 2.36MB ± 0 2.36MB … 2.36MB 0 ( 0%) - 0.0% ± 0.0%
cpu_cycles 4.22G ± 88.4M 4.10G … 4.48G 0 ( 0%) 💩+ 19.5% ± 1.2%
instructions 15.7G ± 400 15.7G … 15.7G 1 ( 5%) 💩+ 21.7% ± 0.0%
cache_references 951M ± 98.2M 727M … 1.13G 0 ( 0%) ⚡- 7.4% ± 5.7%
cache_misses 19.4M ± 4.97M 11.5M … 31.0M 0 ( 0%) - 3.4% ± 13.7%
branch_misses 13.8M ± 78.8K 13.7M … 14.0M 1 ( 5%) ⚡- 1.5% ± 0.3%
Ug. Right. I had this in the old code:
But it's not present in the new code:
I didn't get rid of that on accident. It turns out that replicating that in the new code is extremely difficult because the new code is generic. There is no way to write a generic #[target_feature(...)]
attribute or forward it. So I can't write that inner unlineable function. But... I can at least experiment with it in the x86-64 case and at least see if adding it back makes things better.
...
OK, so I added this:
#[cold]
#[inline(never)]
#[target_feature(enable = "sse2", enable = "avx2")]
unsafe fn matched<V: Vector>(
cur: *const u8,
eqa: V,
eqb: V,
eqc: V,
eqd: V,
) -> *const u8 {
let topos = V::Mask::first_offset;
let mask = eqa.movemask();
if mask.has_non_zero() {
return cur.add(topos(mask));
}
let mask = eqb.movemask();
if mask.has_non_zero() {
return cur.add(1 * V::BYTES).add(topos(mask));
}
let mask = eqc.movemask();
if mask.has_non_zero() {
return cur.add(2 * V::BYTES).add(topos(mask));
}
let mask = eqd.movemask();
debug_assert!(mask.has_non_zero());
return cur.add(3 * V::BYTES).add(topos(mask));
}
And changed the inner check to:
if or3.movemask_will_have_non_zero() {
return Some(matched(cur, eqa, eqb, eqc, eqd));
}
But that doesn't seem to help. Indeed, it looks like matched
is still getting inlined, probably because it's getting monomorphized? Not sure.
I also tried std::intrinsics::unlikely
(unstable) and that didn't help.
I just unfortunately don't really know how to get rid of those extra instructions.
For anyone who might be willing to help, I've created a bit more of a refined reproduction of the issue that shows the codegen problem here: https://github.com/BurntSushi/memchr-2.6-mov-regression
Any help would be appreciated!
This seems to be a register pressure issue. For me, LLVM has hoisted this load of index2
, but spills it to the stack as 0x10(%rsp)
instead of keeping it in a register: https://github.com/BurntSushi/memchr/blob/2.6.4/src/arch/generic/packedpair.rs#L237
The following disgusting patch gets rid of the extra mov
for me. Perhaps it can inspire a non-disgusting patch. The idea is rather than keeping {cur, index1, index2}
all live at once, pre-compute (cur1, cur2) = (cur.add(index1), cur.add(index2))
and then you can forget about the indices.
diff --git a/src/arch/generic/packedpair.rs b/src/arch/generic/packedpair.rs
index 8d97cf2..2667cc8 100644
--- a/src/arch/generic/packedpair.rs
+++ b/src/arch/generic/packedpair.rs
@@ -93,7 +93,6 @@ impl<V: Vector> Finder<V> {
let start = haystack.as_ptr();
let end = start.add(haystack.len());
let max = end.sub(self.min_haystack_len);
- let mut cur = start;
// N.B. I did experiment with unrolling the loop to deal with size(V)
// bytes at a time and 2*size(V) bytes at a time. The double unroll
@@ -101,14 +100,21 @@ impl<V: Vector> Finder<V> {
// slower. In the end, I decided the complexity from unrolling wasn't
// worth it. I used the memmem/krate/prebuilt/huge-en/ benchmarks to
// compare.
- while cur <= max {
- if let Some(chunki) = self.find_in_chunk(needle, cur, end, all) {
- return Some(matched(start, cur, chunki));
+ let index1 = usize::from(self.pair.index1());
+ let index2 = usize::from(self.pair.index2());
+ let mut cur1 = start.add(index1);
+ let mut cur2 = start.add(index2);
+ let end1 = end.add(index1);
+ let max1 = max.add(index1);
+ while cur1 <= max1 {
+ if let Some(chunki) = self.find_in_chunk(needle, cur1, cur2, end1, all) {
+ return Some(matched(start, cur1.sub(index1), chunki));
}
- cur = cur.add(V::BYTES);
+ cur1 = cur1.add(V::BYTES);
+ cur2 = cur2.add(V::BYTES);
}
- if cur < end {
- let remaining = end.distance(cur);
+ if cur1 < end1 {
+ let remaining = end1.distance(cur1);
debug_assert!(
remaining < self.min_haystack_len,
"remaining bytes should be smaller than the minimum haystack \
@@ -120,10 +126,10 @@ impl<V: Vector> Finder<V> {
return None;
}
debug_assert!(
- max < cur,
+ max1 < cur1,
"after main loop, cur should have exceeded max",
);
- let overlap = cur.distance(max);
+ let overlap = cur1.distance(max1);
debug_assert!(
overlap > 0,
"overlap ({}) must always be non-zero",
@@ -140,10 +146,11 @@ impl<V: Vector> Finder<V> {
// occur in find_in_chunk within the overlap are automatically
// ignored.
let mask = V::Mask::all_zeros_except_least_significant(overlap);
- cur = max;
- let m = self.find_in_chunk(needle, cur, end, mask);
+ cur1 = max1;
+ cur2 = max1.sub(index1).add(index2);
+ let m = self.find_in_chunk(needle, cur1, cur2, end1, mask);
if let Some(chunki) = m {
- return Some(matched(start, cur, chunki));
+ return Some(matched(start, cur1.sub(index1), chunki));
}
}
None
@@ -229,25 +236,24 @@ impl<V: Vector> Finder<V> {
unsafe fn find_in_chunk(
&self,
needle: &[u8],
- cur: *const u8,
- end: *const u8,
+ cur1: *const u8,
+ cur2: *const u8,
+ end1: *const u8,
mask: V::Mask,
) -> Option<usize> {
- let index1 = usize::from(self.pair.index1());
- let index2 = usize::from(self.pair.index2());
- let chunk1 = V::load_unaligned(cur.add(index1));
- let chunk2 = V::load_unaligned(cur.add(index2));
+ let chunk1 = V::load_unaligned(cur1);
+ let chunk2 = V::load_unaligned(cur2);
let eq1 = chunk1.cmpeq(self.v1);
let eq2 = chunk2.cmpeq(self.v2);
let mut offsets = eq1.and(eq2).movemask().and(mask);
while offsets.has_non_zero() {
let offset = offsets.first_offset();
- let cur = cur.add(offset);
- if end.sub(needle.len()) < cur {
+ let cur1 = cur1.add(offset);
+ if end1.sub(needle.len()) < cur1 {
return None;
}
- if is_equal_raw(needle.as_ptr(), cur, needle.len()) {
+ if is_equal_raw(needle.as_ptr(), cur1.sub(self.pair.index1() as usize), needle.len()) {
return Some(offset);
}
offsets = offsets.clear_least_significant_bit();
@tavianator Lovely!!! I'm totally fine with that level of disgusting for stuff like this. I'll take a closer look later. Thank you!
Nice, I was looking into this too and came to the same conclusion about register pressure. I was able to get one of the mov
removed by making find_in_chunk
a free function, taking index1
/index2
/v1
/v2
as parameters instead but that seemed hacky.
Maybe interesting is that simpler code in is_equal_raw
also lead to one less mov
, but this certainly needs more benchmarking. On my tigerlake machine, the following version made the throughput on the example go from 30Gib/s to 46Gib/s:
pub unsafe fn is_equal_raw(
mut x: *const u8,
mut y: *const u8,
mut n: usize,
) -> bool {
while n >= 4 {
let vx = x.cast::<u32>().read_unaligned();
let vy = y.cast::<u32>().read_unaligned();
if vx != vy {
return false;
}
x = x.add(4);
y = y.add(4);
n -= 4;
}
if n >= 2 {
if x.cast::<u16>().read_unaligned() != y.cast::<u16>().read_unaligned() {
return false;
}
x = x.add(2);
y = y.add(2);
n -= 2;
}
if n > 0 {
if x.read() != y.read() {
return false;
}
}
true
}
Nice, I was looking into this too and came to the same conclusion about register pressure. I was able to get one of the
mov
removed by makingfind_in_chunk
a free function, takingindex1
/index2
/v1
/v2
as parameters instead but that seemed hacky.
Huh, I thought I tried that and it didn't help. Maybe compiler-dependent, I hadn't updated rustc in a bit
All righty, I've finally had time to swing back around to this. Thank you so much @tavianator and @jhorstmann for the investigation and patches. :-)
The first thing I tried was the simpler is_equal_raw
:
$ rebar diff -e memchr/memmem tmp/base.csv tmp/simpler-is-equal-raw.csv -t 1.1
benchmark engine tmp/base.csv tmp/simpler-is-equal-raw.csv
--------- ------ ------------ ----------------------------
memmem/code/rust-library-never-fn-strength rust/memchr/memmem/prebuilt 40.6 GB/s (1.24x) 50.2 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren rust/memchr/memmem/prebuilt 39.4 GB/s (1.30x) 51.3 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux rust/memchr/memmem/prebuilt 41.0 GB/s (1.25x) 51.4 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str rust/memchr/memmem/prebuilt 38.8 GB/s (1.35x) 52.4 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty rust/memchr/memmem/prebuilt 40.2 GB/s (1.25x) 50.3 GB/s (1.00x)
memmem/code/rust-library-common-fn rust/memchr/memmem/prebuilt 24.7 GB/s (1.18x) 29.2 GB/s (1.00x)
memmem/code/rust-library-common-let rust/memchr/memmem/prebuilt 18.1 GB/s (1.10x) 20.0 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash rust/memchr/memmem/prebuilt 38.5 GB/s (1.30x) 50.0 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash rust/memchr/memmem/prebuilt 38.0 GB/s (1.32x) 50.1 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky rust/memchr/memmem/prebuilt 36.6 GB/s (1.72x) 62.8 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet rust/memchr/memmem/prebuilt 4.0 GB/s (1.33x) 5.4 GB/s (1.00x)
memmem/sliceslice/seemingly-random rust/memchr/memmem/prebuilt 8.1 MB/s (1.17x) 9.5 MB/s (1.00x)
memmem/sliceslice/i386 rust/memchr/memmem/prebuilt 42.7 MB/s (1.25x) 53.5 MB/s (1.00x)
memmem/subtitles/common/huge-en-that rust/memchr/memmem/prebuilt 30.7 GB/s (1.22x) 37.4 GB/s (1.00x)
memmem/subtitles/common/huge-ru-that rust/memchr/memmem/prebuilt 28.8 GB/s (1.26x) 36.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space rust/memchr/memmem/prebuilt 2.6 GB/s (1.00x) 2.3 GB/s (1.12x)
memmem/subtitles/common/huge-zh-that rust/memchr/memmem/prebuilt 27.2 GB/s (1.40x) 38.2 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not rust/memchr/memmem/prebuilt 18.0 GB/s (1.14x) 20.4 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson rust/memchr/memmem/prebuilt 39.8 GB/s (1.55x) 61.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes rust/memchr/memmem/prebuilt 41.1 GB/s (1.25x) 51.4 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes rust/memchr/memmem/prebuilt 40.2 GB/s (1.57x) 63.0 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space rust/memchr/memmem/prebuilt 33.2 GB/s (1.89x) 62.8 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson rust/memchr/memmem/prebuilt 39.2 GB/s (1.61x) 63.1 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson rust/memchr/memmem/prebuilt 39.6 GB/s (1.43x) 56.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes rust/memchr/memmem/prebuilt 39.3 GB/s (1.55x) 61.0 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock rust/memchr/memmem/prebuilt 43.5 GB/s (1.25x) 54.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle rust/memchr/memmem/prebuilt 32.9 GB/s (1.60x) 52.7 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes rust/memchr/memmem/prebuilt 38.7 GB/s (1.58x) 61.0 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock rust/memchr/memmem/prebuilt 39.5 GB/s (1.51x) 59.4 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes rust/memchr/memmem/prebuilt 35.9 GB/s (1.29x) 46.3 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock rust/memchr/memmem/prebuilt 35.5 GB/s (1.56x) 55.1 GB/s (1.00x)
And for the OP's issue using the reproduction repository I made:
$ hyperfine -w5 ./target/release/memchr-2.5.0 ./target/release/memchr-2.6.4 ./target/release/simpler-is-equal-raw
Benchmark 1: ./target/release/memchr-2.5.0
Time (mean ± σ): 1.103 s ± 0.098 s [User: 1.102 s, System: 0.001 s]
Range (min … max): 1.006 s … 1.260 s 10 runs
Benchmark 2: ./target/release/memchr-2.6.4
Time (mean ± σ): 1.096 s ± 0.092 s [User: 1.095 s, System: 0.001 s]
Range (min … max): 1.031 s … 1.286 s 10 runs
Benchmark 3: ./target/release/simpler-is-equal-raw
Time (mean ± σ): 1.035 s ± 0.134 s [User: 1.035 s, System: 0.000 s]
Range (min … max): 0.798 s … 1.178 s 10 runs
Summary
./target/release/simpler-is-equal-raw ran
1.06 ± 0.16 times faster than ./target/release/memchr-2.6.4
1.07 ± 0.17 times faster than ./target/release/memchr-2.5.0
And indeed, looking at the codegen, I can see that one of the mov
instructions was now gone. Sweet. Although, I find it interesting that memchr-2.5.0
and memchr-2.6.4
are virtually the same speed, despite the fact that simpler-is-equal-raw
has one fewer mov
than memchr-2.6.4
and one more mov
than memchr-2.5.0
. Perhaps the other changes to is_equal_raw
are having an impact here. OK, so what happens if I run the hyperfine
command again?
$ hyperfine -w5 ./target/release/memchr-2.5.0 ./target/release/memchr-2.6.4 ./target/release/simpler-is-equal-raw
Benchmark 1: ./target/release/memchr-2.5.0
Time (mean ± σ): 1.071 s ± 0.068 s [User: 1.070 s, System: 0.001 s]
Range (min … max): 1.011 s … 1.258 s 10 runs
Benchmark 2: ./target/release/memchr-2.6.4
Time (mean ± σ): 1.058 s ± 0.019 s [User: 1.057 s, System: 0.001 s]
Range (min … max): 1.030 s … 1.093 s 10 runs
Benchmark 3: ./target/release/simpler-is-equal-raw
Time (mean ± σ): 1.140 s ± 0.118 s [User: 1.139 s, System: 0.001 s]
Range (min … max): 0.918 s … 1.382 s 10 runs
Summary
./target/release/memchr-2.6.4 ran
1.01 ± 0.07 times faster than ./target/release/memchr-2.5.0
1.08 ± 0.11 times faster than ./target/release/simpler-is-equal-raw
OK, so it's just noise. Sigh. What about the rebar benchmarks?
$ rebar diff -e memchr/memmem tmp/base.csv tmp/simpler-is-equal-raw.csv tmp/simpler-is-equal-raw-2.csv -t 1.1
benchmark engine tmp/base.csv tmp/simpler-is-equal-raw.csv tmp/simpler-is-equal-raw-2.csv
--------- ------ ------------ ---------------------------- ------------------------------
memmem/code/rust-library-never-fn-strength rust/memchr/memmem/prebuilt 40.6 GB/s (1.27x) 50.2 GB/s (1.02x) 51.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren rust/memchr/memmem/prebuilt 39.4 GB/s (1.33x) 51.3 GB/s (1.02x) 52.3 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux rust/memchr/memmem/prebuilt 41.0 GB/s (1.25x) 51.4 GB/s (1.00x) 37.0 GB/s (1.39x)
memmem/code/rust-library-rare-fn-from-str rust/memchr/memmem/prebuilt 38.8 GB/s (1.35x) 52.4 GB/s (1.00x) 50.8 GB/s (1.03x)
memmem/code/rust-library-common-fn-is-empty rust/memchr/memmem/prebuilt 40.2 GB/s (1.26x) 50.3 GB/s (1.00x) 50.4 GB/s (1.00x)
memmem/code/rust-library-common-fn rust/memchr/memmem/prebuilt 24.7 GB/s (1.19x) 29.2 GB/s (1.01x) 29.5 GB/s (1.00x)
memmem/code/rust-library-common-let rust/memchr/memmem/prebuilt 18.1 GB/s (1.15x) 20.0 GB/s (1.04x) 20.8 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash rust/memchr/memmem/prebuilt 38.5 GB/s (1.30x) 50.0 GB/s (1.00x) 49.8 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash rust/memchr/memmem/prebuilt 38.0 GB/s (1.33x) 50.1 GB/s (1.00x) 50.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky rust/memchr/memmem/prebuilt 36.6 GB/s (1.72x) 62.8 GB/s (1.00x) 62.6 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet rust/memchr/memmem/prebuilt 4.0 GB/s (1.33x) 5.4 GB/s (1.00x) 5.4 GB/s (1.00x)
memmem/sliceslice/seemingly-random rust/memchr/memmem/prebuilt 8.1 MB/s (1.19x) 9.5 MB/s (1.02x) 9.7 MB/s (1.00x)
memmem/sliceslice/i386 rust/memchr/memmem/prebuilt 42.7 MB/s (1.27x) 53.5 MB/s (1.01x) 54.1 MB/s (1.00x)
memmem/subtitles/common/huge-en-that rust/memchr/memmem/prebuilt 30.7 GB/s (1.22x) 37.4 GB/s (1.00x) 37.6 GB/s (1.00x)
memmem/subtitles/common/huge-en-you rust/memchr/memmem/prebuilt 14.7 GB/s (1.06x) 15.6 GB/s (1.00x) 12.4 GB/s (1.25x)
memmem/subtitles/common/huge-ru-that rust/memchr/memmem/prebuilt 28.8 GB/s (1.27x) 36.2 GB/s (1.01x) 36.5 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space rust/memchr/memmem/prebuilt 2.6 GB/s (1.00x) 2.3 GB/s (1.12x) 2.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that rust/memchr/memmem/prebuilt 27.2 GB/s (1.40x) 38.2 GB/s (1.00x) 38.2 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not rust/memchr/memmem/prebuilt 18.0 GB/s (1.14x) 20.4 GB/s (1.00x) 17.5 GB/s (1.17x)
memmem/subtitles/never/huge-en-john-watson rust/memchr/memmem/prebuilt 39.8 GB/s (1.58x) 61.6 GB/s (1.02x) 62.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes rust/memchr/memmem/prebuilt 41.1 GB/s (1.25x) 51.4 GB/s (1.00x) 46.3 GB/s (1.11x)
memmem/subtitles/never/huge-en-some-rare-bytes rust/memchr/memmem/prebuilt 40.2 GB/s (1.57x) 63.0 GB/s (1.00x) 62.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space rust/memchr/memmem/prebuilt 33.2 GB/s (1.89x) 62.8 GB/s (1.00x) 51.0 GB/s (1.23x)
memmem/subtitles/never/huge-ru-john-watson rust/memchr/memmem/prebuilt 39.2 GB/s (1.61x) 63.1 GB/s (1.00x) 60.1 GB/s (1.05x)
memmem/subtitles/never/huge-zh-john-watson rust/memchr/memmem/prebuilt 39.6 GB/s (1.47x) 56.6 GB/s (1.03x) 58.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes rust/memchr/memmem/prebuilt 39.3 GB/s (1.59x) 61.0 GB/s (1.03x) 62.7 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock rust/memchr/memmem/prebuilt 43.5 GB/s (1.25x) 54.2 GB/s (1.00x) 53.2 GB/s (1.02x)
memmem/subtitles/rare/huge-en-medium-needle rust/memchr/memmem/prebuilt 32.9 GB/s (1.60x) 52.7 GB/s (1.00x) 52.7 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes rust/memchr/memmem/prebuilt 38.7 GB/s (1.60x) 61.0 GB/s (1.02x) 62.1 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock rust/memchr/memmem/prebuilt 39.5 GB/s (1.51x) 59.4 GB/s (1.00x) 59.0 GB/s (1.01x)
memmem/subtitles/rare/huge-zh-sherlock-holmes rust/memchr/memmem/prebuilt 35.9 GB/s (1.53x) 46.3 GB/s (1.19x) 54.8 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock rust/memchr/memmem/prebuilt 35.5 GB/s (1.56x) 55.1 GB/s (1.00x) 54.7 GB/s (1.01x)
OK, so at least according to the rebar benchmarks, the win seems pretty solid. There's still a bit of noise, but the simpler-is-equal-raw
measurements are solidly better than the status quo.
Now let's move on to @tavianator's patch (keeping the simpler and apparently faster is_equal_raw
routine). I applied the patch and re-ran the rebar benchmarks:
$ rebar diff -e memchr/memmem tmp/base.csv tmp/simpler-is-equal-raw.csv tmp/precompute-indices.csv -t 1.1
benchmark engine tmp/base.csv tmp/simpler-is-equal-raw.csv tmp/precompute-indices.csv
--------- ------ ------------ ---------------------------- --------------------------
memmem/code/rust-library-never-fn-strength rust/memchr/memmem/prebuilt 40.6 GB/s (1.31x) 50.2 GB/s (1.06x) 53.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren rust/memchr/memmem/prebuilt 39.4 GB/s (1.33x) 51.3 GB/s (1.02x) 52.5 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux rust/memchr/memmem/prebuilt 41.0 GB/s (1.26x) 51.4 GB/s (1.00x) 51.5 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str rust/memchr/memmem/prebuilt 38.8 GB/s (1.35x) 52.4 GB/s (1.00x) 52.5 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty rust/memchr/memmem/prebuilt 40.2 GB/s (1.25x) 50.3 GB/s (1.00x) 49.2 GB/s (1.02x)
memmem/code/rust-library-common-fn rust/memchr/memmem/prebuilt 24.7 GB/s (1.18x) 29.2 GB/s (1.00x) 25.3 GB/s (1.16x)
memmem/code/rust-library-common-let rust/memchr/memmem/prebuilt 18.1 GB/s (1.11x) 20.0 GB/s (1.01x) 20.1 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash rust/memchr/memmem/prebuilt 38.5 GB/s (1.30x) 50.0 GB/s (1.00x) 39.6 GB/s (1.26x)
memmem/pathological/md5-huge-last-hash rust/memchr/memmem/prebuilt 38.0 GB/s (1.32x) 50.1 GB/s (1.00x) 49.3 GB/s (1.02x)
memmem/pathological/rare-repeated-huge-tricky rust/memchr/memmem/prebuilt 36.6 GB/s (1.72x) 62.8 GB/s (1.00x) 39.5 GB/s (1.59x)
memmem/pathological/defeat-simple-vector-alphabet rust/memchr/memmem/prebuilt 4.0 GB/s (1.33x) 5.4 GB/s (1.00x) 5.4 GB/s (1.00x)
memmem/sliceslice/seemingly-random rust/memchr/memmem/prebuilt 8.1 MB/s (1.17x) 9.5 MB/s (1.01x) 9.6 MB/s (1.00x)
memmem/sliceslice/i386 rust/memchr/memmem/prebuilt 42.7 MB/s (1.25x) 53.5 MB/s (1.00x) 36.5 MB/s (1.46x)
memmem/subtitles/common/huge-en-that rust/memchr/memmem/prebuilt 30.7 GB/s (1.22x) 37.4 GB/s (1.00x) 29.4 GB/s (1.27x)
memmem/subtitles/common/huge-ru-that rust/memchr/memmem/prebuilt 28.8 GB/s (1.26x) 36.2 GB/s (1.00x) 34.7 GB/s (1.04x)
memmem/subtitles/common/huge-ru-one-space rust/memchr/memmem/prebuilt 2.6 GB/s (1.00x) 2.3 GB/s (1.12x) 2.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that rust/memchr/memmem/prebuilt 27.2 GB/s (1.40x) 38.2 GB/s (1.00x) 31.7 GB/s (1.21x)
memmem/subtitles/common/huge-zh-do-not rust/memchr/memmem/prebuilt 18.0 GB/s (1.14x) 20.4 GB/s (1.00x) 19.6 GB/s (1.04x)
memmem/subtitles/never/huge-en-john-watson rust/memchr/memmem/prebuilt 39.8 GB/s (1.55x) 61.6 GB/s (1.00x) 57.8 GB/s (1.07x)
memmem/subtitles/never/huge-en-all-common-bytes rust/memchr/memmem/prebuilt 41.1 GB/s (1.25x) 51.4 GB/s (1.00x) 46.0 GB/s (1.12x)
memmem/subtitles/never/huge-en-some-rare-bytes rust/memchr/memmem/prebuilt 40.2 GB/s (1.57x) 63.0 GB/s (1.00x) 51.7 GB/s (1.22x)
memmem/subtitles/never/huge-en-two-space rust/memchr/memmem/prebuilt 33.2 GB/s (1.89x) 62.8 GB/s (1.00x) 38.2 GB/s (1.65x)
memmem/subtitles/never/huge-ru-john-watson rust/memchr/memmem/prebuilt 39.2 GB/s (1.61x) 63.1 GB/s (1.00x) 56.6 GB/s (1.11x)
memmem/subtitles/never/huge-zh-john-watson rust/memchr/memmem/prebuilt 39.6 GB/s (1.43x) 56.6 GB/s (1.00x) 30.3 GB/s (1.87x)
memmem/subtitles/rare/huge-en-sherlock-holmes rust/memchr/memmem/prebuilt 39.3 GB/s (1.55x) 61.0 GB/s (1.00x) 54.7 GB/s (1.12x)
memmem/subtitles/rare/huge-en-sherlock rust/memchr/memmem/prebuilt 43.5 GB/s (1.31x) 54.2 GB/s (1.06x) 57.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle rust/memchr/memmem/prebuilt 32.9 GB/s (1.68x) 52.7 GB/s (1.04x) 55.1 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes rust/memchr/memmem/prebuilt 38.7 GB/s (1.58x) 61.0 GB/s (1.00x) 56.3 GB/s (1.08x)
memmem/subtitles/rare/huge-ru-sherlock rust/memchr/memmem/prebuilt 39.5 GB/s (1.51x) 59.4 GB/s (1.00x) 56.9 GB/s (1.04x)
memmem/subtitles/rare/huge-zh-sherlock-holmes rust/memchr/memmem/prebuilt 35.9 GB/s (1.29x) 46.3 GB/s (1.00x) 29.6 GB/s (1.56x)
memmem/subtitles/rare/huge-zh-sherlock rust/memchr/memmem/prebuilt 35.5 GB/s (1.56x) 55.1 GB/s (1.00x) 47.2 GB/s (1.17x)
My interpretation there is that for some benchmarks, it's about on par with simpler-is-equal-raw
, but there are several benchmarks where it appears to regress back to the status quo. The hyperfine
benchmarks from the reproduction repository are already established to be very noisy, but if I look at the codegen there via perf
, I see this:
Interestingly, while the mov
instructions are gone, there are extra add
instructions in its place. I guess this makes sense to me since this does seem to require a few more adds/subs in the critical section.
So I think for now, I'm just going to go with the simpler is_equal_raw
and see how far that gets us. Thank you both again!
So to sum up, there is still one more mov
instruction in the critical path that wasn't there in memchr 2.5.0
. Since I can't seem to completely reproduce the problem in the OP, I'm not sure if removing one of the mov
instructions is enough to fix the regression. But I'm happy to start there for now.
It does look to me like the code itself could be improved. If there's a lot of register pressure forcing stack spilling, then that's not good. It would be nice to brainstorm ways of fixing that. (Ideally not by using inline assembly.)
This should be largely fixed in memchr 2.7.0
on crates.io. I welcome further improvements here.
Unfortunately, I'm not able to provide many details, but when upgrading from 2.5 to 2.6, my library, which does large file parsing via many
find
calls, saw a small performance regression. The flamegraph isn't entirely helpful because there appears to be different amounts of inlining and some of the functions have changed. If the core x86Finder::find
has not changed then this could just be compilation differences.Speed measurements here are relative to the 2.5 version, so that is why you see an error bar centered at zero. The general processing speed is between 300 and 500 megabytes per second for scale. Measurements were taken on a system using one core at a time, so no multi-threading. Three measurements were taken for each run.
Here is one example of a perf diff:
and another for the large gap labeled G above:
The scenario that G is in is that it's searching a huge number of bytes and never finding the needle in the haystack. So that is why the performance progression is so much worse for this test scenario. Note that this is also using a 5 byte
Finder
/needleAny advice in trying to track down or recreate this with a benchmark? I know that this report is somewhat unhelpful in diagnosing the actual issue, but I felt it better to at least post what information I can so that it's on the radar.
System info
Ec2 c6a.12xlarge