BurntSushi / memchr

Optimized string search routines for Rust.
The Unlicense
799 stars 97 forks source link

`x86` performance regression `2.5.0` -> `2.6.0` #139

Closed cgbur closed 6 months ago

cgbur commented 8 months ago

Unfortunately, I'm not able to provide many details, but when upgrading from 2.5 to 2.6, my library, which does large file parsing via many find calls, saw a small performance regression. The flamegraph isn't entirely helpful because there appears to be different amounts of inlining and some of the functions have changed. If the core x86 Finder::find has not changed then this could just be compilation differences.

Speed measurements here are relative to the 2.5 version, so that is why you see an error bar centered at zero. The general processing speed is between 300 and 500 megabytes per second for scale. Measurements were taken on a system using one core at a time, so no multi-threading. Three measurements were taken for each run.

image image

Here is one example of a perf diff:

❯ perf diff 25.data 26.data

               +7.11%  parser                [.] memchr::arch::x86_64::avx2::memchr::One::find_raw_avx2
               +1.27%  parser                [.] memchr::arch::x86_64::memchr::memchr_raw::find_avx2
               +1.18%  parser                [.] memchr::memmem::searcher::searcher_kind_one_byte
               +1.06%  parser                [.] memchr::arch::x86_64::avx2::memchr::One::find_raw
     8.36%     +0.82%  libc-2.26.so        [.] __memcmp_sse4_1
               +0.52%  parser                [.] memchr::arch::x86_64::avx2::packedpair::Finder::find_impl

and another for the large gap labeled G above:

❯ perf diff 25.data 26.data
              +78.62%  parser                [.] memchr::arch::x86_64::avx2::packedpair::Finder::find_impl
               +2.65%  parser                [.] memchr::arch::x86_64::avx2::memchr::One::find_raw_avx2
    13.27%     -0.29%  libc-2.26.so        [.] __memcpy_ssse3
               +0.11%  parser                [.] memchr::memmem::searcher::searcher_kind_avx2
    78.07%             parser                [.] memchr::memmem::x86::avx::std::Forward::find_impl
     2.83%             parser                [.] memchr::memchr::x86::avx::memchr
     0.24%             parser                [.] memchr::memmem::Finder::find

The scenario that G is in is that it's searching a huge number of bytes and never finding the needle in the haystack. So that is why the performance progression is so much worse for this test scenario. Note that this is also using a 5 byte Finder/needle

Any advice in trying to track down or recreate this with a benchmark? I know that this report is somewhat unhelpful in diagnosing the actual issue, but I felt it better to at least post what information I can so that it's on the radar.

System info

Ec2 c6a.12xlarge

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor
BurntSushi commented 8 months ago

For context, the 2.6 release did a complete refactoring of all the SIMD code in this crate. The algorithms themselves didn't really change, but pretty much everything around them did. This was done to facilitate aarch64 and wasm32 support, as the previous infrastructure wasn't really amenable to supporting other targets easily. The unfortunate aspect of this refactoring is that there really is no simple way to compare the two. Your best bet is to look at the Assembly of the hottest code and see if there are any relevant codegen differences at that level. Unfortunately, I've found that it is sometimes the case that branches can get flipped around for seemingly unrelated changes in the code itself. So this may be a difficult thing to fix.

Now I did to a pretty thorough benchmark comparison between the old and the new. You can even explore the results yourself with rebar in a checkout of this repo:

$ rebar cmp benchmarks/record/x86_64/2023-08-27.csv -e '^rust/memchr/memmem/prebuilt' -e '^rust/memchrold/memmem/prebuilt' -t 1.1
benchmark                                                   rust/memchr/memmem/prebuilt  rust/memchrold/memmem/prebuilt
---------                                                   ---------------------------  ------------------------------
memmem/code/rust-library-never-fn-strength                  37.7 GB/s (1.13x)            42.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux                      55.7 GB/s (1.00x)            38.7 GB/s (1.44x)
memmem/code/rust-library-rare-fn-from-str                   53.6 GB/s (1.00x)            40.3 GB/s (1.33x)
memmem/code/rust-library-common-fn-is-empty                 52.8 GB/s (1.00x)            38.3 GB/s (1.38x)
memmem/code/rust-library-common-fn                          28.3 GB/s (1.00x)            21.7 GB/s (1.30x)
memmem/code/rust-library-common-let                         19.6 GB/s (1.00x)            16.1 GB/s (1.22x)
memmem/pathological/md5-huge-no-hash                        50.3 GB/s (1.00x)            42.1 GB/s (1.20x)
memmem/pathological/rare-repeated-huge-tricky               63.4 GB/s (1.00x)            40.9 GB/s (1.55x)
memmem/pathological/rare-repeated-huge-match                1853.9 MB/s (1.00x)          1574.8 MB/s (1.18x)
memmem/pathological/rare-repeated-small-tricky              25.9 GB/s (1.00x)            22.7 GB/s (1.14x)
memmem/pathological/rare-repeated-small-match               1811.4 MB/s (1.00x)          1468.7 MB/s (1.23x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  1238.3 MB/s (1.00x)          1101.1 MB/s (1.12x)
memmem/sliceslice/short                                     6.53ms (1.00x)               14.65ms (2.24x)
memmem/sliceslice/seemingly-random                          11.3 MB/s (1.00x)            9.1 MB/s (1.24x)
memmem/subtitles/common/huge-en-you                         15.5 GB/s (1.00x)            10.9 GB/s (1.42x)
memmem/subtitles/common/huge-en-one-space                   1339.0 MB/s (1.12x)          1503.9 MB/s (1.00x)
memmem/subtitles/common/huge-ru-that                        37.1 GB/s (1.00x)            29.6 GB/s (1.25x)
memmem/subtitles/common/huge-zh-that                        26.0 GB/s (1.10x)            28.8 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      17.2 GB/s (1.00x)            14.4 GB/s (1.19x)
memmem/subtitles/common/huge-zh-one-space                   4.7 GB/s (1.10x)             5.2 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  63.6 GB/s (1.00x)            41.3 GB/s (1.54x)
memmem/subtitles/never/huge-en-all-common-bytes             52.7 GB/s (1.00x)            41.8 GB/s (1.26x)
memmem/subtitles/never/huge-en-some-rare-bytes              63.6 GB/s (1.00x)            43.0 GB/s (1.48x)
memmem/subtitles/never/huge-en-two-space                    62.2 GB/s (1.00x)            36.2 GB/s (1.72x)
memmem/subtitles/never/teeny-en-john-watson                 1780.2 MB/s (1.00x)          1161.0 MB/s (1.53x)
memmem/subtitles/never/teeny-en-all-common-bytes            1780.2 MB/s (1.00x)          1161.0 MB/s (1.53x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1668.9 MB/s (1.00x)          1161.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-two-space                   1780.2 MB/s (1.00x)          1161.0 MB/s (1.53x)
memmem/subtitles/never/huge-ru-john-watson                  63.5 GB/s (1.00x)            34.3 GB/s (1.85x)
memmem/subtitles/never/teeny-ru-john-watson                 2.4 GB/s (1.00x)             1741.5 MB/s (1.44x)
memmem/subtitles/never/huge-zh-john-watson                  60.1 GB/s (1.00x)            42.4 GB/s (1.42x)
memmem/subtitles/never/teeny-zh-john-watson                 1970.9 MB/s (1.00x)          1285.4 MB/s (1.53x)
memmem/subtitles/rare/huge-en-sherlock                      60.1 GB/s (1.00x)            40.8 GB/s (1.47x)
memmem/subtitles/rare/huge-en-medium-needle                 54.1 GB/s (1.00x)            40.6 GB/s (1.33x)
memmem/subtitles/rare/huge-en-long-needle                   43.7 GB/s (1.00x)            2.5 GB/s (17.22x)
memmem/subtitles/rare/huge-en-huge-needle                   46.6 GB/s (1.00x)            2.3 GB/s (20.52x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              1570.8 MB/s (1.00x)          1027.0 MB/s (1.53x)
memmem/subtitles/rare/teeny-en-sherlock                     1271.6 MB/s (1.00x)          953.7 MB/s (1.33x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               63.4 GB/s (1.00x)            41.9 GB/s (1.51x)
memmem/subtitles/rare/huge-ru-sherlock                      61.9 GB/s (1.00x)            40.6 GB/s (1.52x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              2.1 GB/s (1.00x)             1430.5 MB/s (1.47x)
memmem/subtitles/rare/teeny-ru-sherlock                     1602.2 MB/s (1.00x)          1251.7 MB/s (1.28x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               37.0 GB/s (1.14x)            42.3 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1137.1 MB/s (1.00x)          895.9 MB/s (1.27x)

As you can see, the new code is faster in the vast majority of cases. But there are some cases where the old code is seemingly a touch faster. But the differences are overall pretty small. The other thing to note here is that improvements in one case can lead to regressions in another.

I will note that I've done all my benchmarking for x86_64 on Intel CPUs where as you seem to have an AMD CPU. So the differences may not line up.

Eyeballing it, I'd say all of your perf differences except for g look "close enough" to be within noise or are otherwise acceptable from my perspective.

I only really see two paths forward here. Either you can get me a repro and I can look into it, or you'll have to dive in and look into it yourself. I'm not really sure there's another way. Perf regressions like this really really need a repro of some kind. I know it can be a lot of work to produce one.

cgbur commented 8 months ago

Ok I have made a tiny reproducible example:

cargo new --bin bench-memchr
cd bench-memchr
curl -o frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt

Cargo.toml

[package]
name = "bench-memchr"
version = "0.1.0"
edition = "2021"

[dependencies]
memchr = "=2.6.4" # or =2.5.0 when testing that version

main.rs

use memchr::memmem::Finder;
use std::{fs::File, io::Read, time::Instant};

const NEEDLE: &[u8] = b"Burnt";
const NUM_ITERATIONS: usize = 100000;
const FILE_NAME: &str = "frankenstein.txt";

fn main() {
    let mut file = File::open(FILE_NAME).unwrap();
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).unwrap();

    let finder = Finder::new(NEEDLE);

    let mut count = 0;
    let start = Instant::now();
    for _ in 0..NUM_ITERATIONS {
        count += search(&buffer, &finder);
    }
    let end = start.elapsed();

    println!("Found {} matches in {:?}", count, end);
    let bytes_per_sec = (buffer.len() * NUM_ITERATIONS) as f64 / end.as_secs_f64();
    println!("{} GiB/s", bytes_per_sec / 1024.0 / 1024.0 / 1024.0);
}

#[inline(never)]
fn search(buf: &[u8], finder: &Finder) -> usize {
    let mut count = 0;
    for _ in finder.find_iter(buf) {
        count += 1;
    }
    count
}

And running this:

❯ hyperfine './memchr-2.5.0' './memchr-2.6.4'
Benchmark 1: ./memchr-2.5.0
  Time (mean ± σ):     954.0 ms ±  23.6 ms    [User: 952.9 ms, System: 1.1 ms]
  Range (min … max):   926.2 ms … 1005.8 ms    10 runs

Benchmark 2: ./memchr-2.6.4
  Time (mean ± σ):      1.296 s ±  0.032 s    [User: 1.295 s, System: 0.001 s]
  Range (min … max):    1.262 s …  1.343 s    10 runs

Summary
  ./memchr-2.5.0 ran
    1.36 ± 0.05 times faster than ./memchr-2.6.4

The reported speed difference is ~32GiB/s vs ~42GiB/s

I tried looking at the assembly outputs, but struggling a bit to know what I'm looking at or looking for.

Here are the two flamegraphs: 1.5.0: 25 1.6.4: 26

BurntSushi commented 8 months ago

Thanks! I'll take a look later.

For Assembly, the best way in my experience is to use perf on Linux. It should be pretty clear where the hotspot is in functions like memchr::arch::x86_64::avx2::packedpair::Finder::find_impl. Look for the SIMD instructions starting with v. That should be where most of the time is being spent. Then compare those with the instructions used in memchr 2.5.0.

But I'll take a look now that I have a repro. Just not sure when. Thank you!

cgbur commented 8 months ago

2.5

       │280:┌─→add          $0x20,%rcx                                                                                                  ▒
  7.60 │    │  cmp          %rsi,%rcx                                                                                                   ▒
  3.78 │    │↑ ja           123                                                                                                         ▒
       │28d:│  vpcmpeqb     (%rcx,%rdx,1),%ymm0,%ymm2                                                                                   ▒
  9.68 │    │  vpcmpeqb     (%rcx,%rdi,1),%ymm1,%ymm3                                                                                   ▒
  6.71 │    │  vpand        %ymm2,%ymm3,%ymm2                                                                                           ▒
  3.78 │    │  vpmovmskb    %ymm2,%r12d                                                                                                 ▒
  1.37 │    │  test         %r12d,%r12d                                                                                                 ▒
 59.11 │    └──je           280                    

2.6

  0.01 │420:┌─→add        $0x20,%rcx                                                                                                    ▒
  2.13 │    │  cmp        %rdx,%rcx                                                                                                     ▒
  0.02 │    │↑ ja         10b                                                                                                           ▒
       │42d:│  mov        0x10(%rsp),%rax                                                                                               ▒
  4.02 │    │  vpcmpeqb   (%rcx,%rax,1),%ymm0,%ymm2                                                                                     ▒
  9.59 │    │  vpcmpeqb   (%rcx,%rbx,1),%ymm1,%ymm3                                                                                     ▒
  3.62 │    │  vpand      %ymm2,%ymm3,%ymm2                                                                                             ◆
  2.68 │    │  vpmovmskb  %ymm2,%ebp                                                                                                    ▒
  0.44 │    │  test       %ebp,%ebp                                                                                                     ▒
 69.99 │    └──je         420                           

This appears to be the hot loop. Extra move instruction snuck its way in not sure what thats for. Unfortunately this ec2 machine does not let me see perf counters so I cant look at branch prediction etc. I can try later on my own hardware and recreate this then I'd have more info. This is about as deep of an analysis as I'll do without more info because I'd just be guessing!

BurntSushi commented 8 months ago

In my case, there is a regression, but it is much smaller:

$ hyperfine './target/release/i139-2.5.0' './target/release/i139-2.6.4'
Benchmark 1: ./target/release/i139-2.5.0
  Time (mean ± σ):      1.065 s ±  0.049 s    [User: 1.064 s, System: 0.001 s]
  Range (min … max):    1.009 s …  1.196 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./target/release/i139-2.6.4
  Time (mean ± σ):      1.128 s ±  0.014 s    [User: 1.127 s, System: 0.000 s]
  Range (min … max):    1.109 s …  1.142 s    10 runs

Summary
  './target/release/i139-2.5.0' ran
    1.06 ± 0.05 times faster than './target/release/i139-2.6.4'

If I look at the codegen, I can indeed see an extra mov instruction. Actually, I see two of them. In 2.5.0, the hot loop contains 9 instructions where as in 2.6.4 it has 11 instructions (2 extra mov). For 2.5.0:

codegen-2 5 0

And 2.6.4:

codegen-2 6 4

The extent to which these extra instructions actually impacts perf can be highly variable. As can be seen here, we are seeing two very different scales of regression here. When I re-arranged the code, I do recall noticing the extra instructions (because I compared the codegen), but concluded that it was okay because the benchmarks didn't seem to suggest it mattered much. But perhaps that is not always the case. I don't know if the extra mov instructions are the culprit here or not, but it's the only obvious difference I can see between 2.5.0 and 2.6.4 on this particular benchmark.

The unfortunate bit is that getting rid of those extra mov instructions is a bit of a black art... Probably because I don't really understand why they show up there in the first place. We probably need an llvm expert to unravel that mystery. Anyway, I'll take a crack at trying to adjust the code to get rid of the mov instructions.

cgbur commented 8 months ago

I just tested on my machine:

AMD Ryzen 9 5900X 12-Core Processor

confirm the performance regression remains:

❯ hyperfine "./memchr-2.5.0" "./memchr-2.6.4" 
Benchmark 1: ./memchr-2.5.0
  Time (mean ± σ):     758.6 ms ±  14.0 ms    [User: 756.4 ms, System: 1.8 ms]
  Range (min … max):   730.5 ms … 779.8 ms    10 runs

Benchmark 2: ./memchr-2.6.4
  Time (mean ± σ):     915.3 ms ±   9.3 ms    [User: 914.0 ms, System: 0.9 ms]
  Range (min … max):   905.8 ms … 932.6 ms    10 runs

Summary
  ./memchr-2.5.0 ran
    1.21 ± 0.03 times faster than ./memchr-2.6.4

Using https://github.com/andrewrk/poop you can see it appears to mostly be executing more instructions, so nothing to do with cache/branch misses it seems.

❯ poop "./memchr-2.5.0" "./memchr-2.6.4" -d 20000
Benchmark 1 (26 runs): ./memchr-2.5.0
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           774ms ± 14.9ms     738ms …  797ms          0 ( 0%)        0%
  peak_rss           2.36MB ±    0      2.36MB … 2.36MB          0 ( 0%)        0%
  cpu_cycles         3.54G  ± 53.1M     3.41G  … 3.61G           0 ( 0%)        0%
  instructions       12.9G  ±  324      12.9G  … 12.9G           0 ( 0%)        0%
  cache_references   1.03G  ±  101M      763M  … 1.16G           1 ( 4%)        0%
  cache_misses       20.1M  ± 4.45M     11.3M  … 30.3M           0 ( 0%)        0%
  branch_misses      14.0M  ± 84.2K     13.9M  … 14.1M           0 ( 0%)        0%
Benchmark 2 (22 runs): ./memchr-2.6.4
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           924ms ± 25.2ms     867ms …  973ms          1 ( 5%)        💩+ 19.3% ±  1.5%
  peak_rss           2.36MB ±    0      2.36MB … 2.36MB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         4.22G  ± 88.4M     4.10G  … 4.48G           0 ( 0%)        💩+ 19.5% ±  1.2%
  instructions       15.7G  ±  400      15.7G  … 15.7G           1 ( 5%)        💩+ 21.7% ±  0.0%
  cache_references    951M  ± 98.2M      727M  … 1.13G           0 ( 0%)        ⚡-  7.4% ±  5.7%
  cache_misses       19.4M  ± 4.97M     11.5M  … 31.0M           0 ( 0%)          -  3.4% ± 13.7%
  branch_misses      13.8M  ± 78.8K     13.7M  … 14.0M           1 ( 5%)        ⚡-  1.5% ±  0.3%
BurntSushi commented 8 months ago

Ug. Right. I had this in the old code:

https://github.com/BurntSushi/memchr/blob/c55597192ead1896c5c7cad2632f01d57ab41da0/src/memchr/x86/avx.rs#L23-L63

But it's not present in the new code:

https://github.com/BurntSushi/memchr/blob/655017102a13bcfa2488cfaed8a22520974908c7/src/arch/generic/memchr.rs#L175-L207

I didn't get rid of that on accident. It turns out that replicating that in the new code is extremely difficult because the new code is generic. There is no way to write a generic #[target_feature(...)] attribute or forward it. So I can't write that inner unlineable function. But... I can at least experiment with it in the x86-64 case and at least see if adding it back makes things better.

...

OK, so I added this:

        #[cold]
        #[inline(never)]
        #[target_feature(enable = "sse2", enable = "avx2")]
        unsafe fn matched<V: Vector>(
            cur: *const u8,
            eqa: V,
            eqb: V,
            eqc: V,
            eqd: V,
        ) -> *const u8 {
            let topos = V::Mask::first_offset;

            let mask = eqa.movemask();
            if mask.has_non_zero() {
                return cur.add(topos(mask));
            }

            let mask = eqb.movemask();
            if mask.has_non_zero() {
                return cur.add(1 * V::BYTES).add(topos(mask));
            }

            let mask = eqc.movemask();
            if mask.has_non_zero() {
                return cur.add(2 * V::BYTES).add(topos(mask));
            }

            let mask = eqd.movemask();
            debug_assert!(mask.has_non_zero());
            return cur.add(3 * V::BYTES).add(topos(mask));
        }

And changed the inner check to:

                if or3.movemask_will_have_non_zero() {
                    return Some(matched(cur, eqa, eqb, eqc, eqd));
                }

But that doesn't seem to help. Indeed, it looks like matched is still getting inlined, probably because it's getting monomorphized? Not sure.

I also tried std::intrinsics::unlikely (unstable) and that didn't help.

I just unfortunately don't really know how to get rid of those extra instructions.

BurntSushi commented 8 months ago

For anyone who might be willing to help, I've created a bit more of a refined reproduction of the issue that shows the codegen problem here: https://github.com/BurntSushi/memchr-2.6-mov-regression

Any help would be appreciated!

tavianator commented 8 months ago

This seems to be a register pressure issue. For me, LLVM has hoisted this load of index2, but spills it to the stack as 0x10(%rsp) instead of keeping it in a register: https://github.com/BurntSushi/memchr/blob/2.6.4/src/arch/generic/packedpair.rs#L237

tavianator commented 8 months ago

The following disgusting patch gets rid of the extra mov for me. Perhaps it can inspire a non-disgusting patch. The idea is rather than keeping {cur, index1, index2} all live at once, pre-compute (cur1, cur2) = (cur.add(index1), cur.add(index2)) and then you can forget about the indices.

diff --git a/src/arch/generic/packedpair.rs b/src/arch/generic/packedpair.rs
index 8d97cf2..2667cc8 100644
--- a/src/arch/generic/packedpair.rs
+++ b/src/arch/generic/packedpair.rs
@@ -93,7 +93,6 @@ impl<V: Vector> Finder<V> {
         let start = haystack.as_ptr();
         let end = start.add(haystack.len());
         let max = end.sub(self.min_haystack_len);
-        let mut cur = start;

         // N.B. I did experiment with unrolling the loop to deal with size(V)
         // bytes at a time and 2*size(V) bytes at a time. The double unroll
@@ -101,14 +100,21 @@ impl<V: Vector> Finder<V> {
         // slower. In the end, I decided the complexity from unrolling wasn't
         // worth it. I used the memmem/krate/prebuilt/huge-en/ benchmarks to
         // compare.
-        while cur <= max {
-            if let Some(chunki) = self.find_in_chunk(needle, cur, end, all) {
-                return Some(matched(start, cur, chunki));
+        let index1 = usize::from(self.pair.index1());
+        let index2 = usize::from(self.pair.index2());
+        let mut cur1 = start.add(index1);
+        let mut cur2 = start.add(index2);
+        let end1 = end.add(index1);
+        let max1 = max.add(index1);
+        while cur1 <= max1 {
+            if let Some(chunki) = self.find_in_chunk(needle, cur1, cur2, end1, all) {
+                return Some(matched(start, cur1.sub(index1), chunki));
             }
-            cur = cur.add(V::BYTES);
+            cur1 = cur1.add(V::BYTES);
+            cur2 = cur2.add(V::BYTES);
         }
-        if cur < end {
-            let remaining = end.distance(cur);
+        if cur1 < end1 {
+            let remaining = end1.distance(cur1);
             debug_assert!(
                 remaining < self.min_haystack_len,
                 "remaining bytes should be smaller than the minimum haystack \
@@ -120,10 +126,10 @@ impl<V: Vector> Finder<V> {
                 return None;
             }
             debug_assert!(
-                max < cur,
+                max1 < cur1,
                 "after main loop, cur should have exceeded max",
             );
-            let overlap = cur.distance(max);
+            let overlap = cur1.distance(max1);
             debug_assert!(
                 overlap > 0,
                 "overlap ({}) must always be non-zero",
@@ -140,10 +146,11 @@ impl<V: Vector> Finder<V> {
             // occur in find_in_chunk within the overlap are automatically
             // ignored.
             let mask = V::Mask::all_zeros_except_least_significant(overlap);
-            cur = max;
-            let m = self.find_in_chunk(needle, cur, end, mask);
+            cur1 = max1;
+            cur2 = max1.sub(index1).add(index2);
+            let m = self.find_in_chunk(needle, cur1, cur2, end1, mask);
             if let Some(chunki) = m {
-                return Some(matched(start, cur, chunki));
+                return Some(matched(start, cur1.sub(index1), chunki));
             }
         }
         None
@@ -229,25 +236,24 @@ impl<V: Vector> Finder<V> {
     unsafe fn find_in_chunk(
         &self,
         needle: &[u8],
-        cur: *const u8,
-        end: *const u8,
+        cur1: *const u8,
+        cur2: *const u8,
+        end1: *const u8,
         mask: V::Mask,
     ) -> Option<usize> {
-        let index1 = usize::from(self.pair.index1());
-        let index2 = usize::from(self.pair.index2());
-        let chunk1 = V::load_unaligned(cur.add(index1));
-        let chunk2 = V::load_unaligned(cur.add(index2));
+        let chunk1 = V::load_unaligned(cur1);
+        let chunk2 = V::load_unaligned(cur2);
         let eq1 = chunk1.cmpeq(self.v1);
         let eq2 = chunk2.cmpeq(self.v2);

         let mut offsets = eq1.and(eq2).movemask().and(mask);
         while offsets.has_non_zero() {
             let offset = offsets.first_offset();
-            let cur = cur.add(offset);
-            if end.sub(needle.len()) < cur {
+            let cur1 = cur1.add(offset);
+            if end1.sub(needle.len()) < cur1 {
                 return None;
             }
-            if is_equal_raw(needle.as_ptr(), cur, needle.len()) {
+            if is_equal_raw(needle.as_ptr(), cur1.sub(self.pair.index1() as usize), needle.len()) {
                 return Some(offset);
             }
             offsets = offsets.clear_least_significant_bit();
BurntSushi commented 8 months ago

@tavianator Lovely!!! I'm totally fine with that level of disgusting for stuff like this. I'll take a closer look later. Thank you!

jhorstmann commented 8 months ago

Nice, I was looking into this too and came to the same conclusion about register pressure. I was able to get one of the mov removed by making find_in_chunk a free function, taking index1/index2/v1/v2 as parameters instead but that seemed hacky.

Maybe interesting is that simpler code in is_equal_raw also lead to one less mov, but this certainly needs more benchmarking. On my tigerlake machine, the following version made the throughput on the example go from 30Gib/s to 46Gib/s:

pub unsafe fn is_equal_raw(
    mut x: *const u8,
    mut y: *const u8,
    mut n: usize,
) -> bool {
    while n >= 4 {
        let vx = x.cast::<u32>().read_unaligned();
        let vy = y.cast::<u32>().read_unaligned();
        if vx != vy {
            return false;
        }
        x = x.add(4);
        y = y.add(4);
        n -= 4;
    }
    if n >= 2 {
        if x.cast::<u16>().read_unaligned() != y.cast::<u16>().read_unaligned() {
            return false;
        }
        x = x.add(2);
        y = y.add(2);
        n -= 2;
    }
    if n > 0 {
        if x.read() != y.read() {
            return false;
        }
    }
    true
}

compiler explorer comparing the assembly for both versions

tavianator commented 8 months ago

Nice, I was looking into this too and came to the same conclusion about register pressure. I was able to get one of the mov removed by making find_in_chunk a free function, taking index1/index2/v1/v2 as parameters instead but that seemed hacky.

Huh, I thought I tried that and it didn't help. Maybe compiler-dependent, I hadn't updated rustc in a bit

BurntSushi commented 6 months ago

All righty, I've finally had time to swing back around to this. Thank you so much @tavianator and @jhorstmann for the investigation and patches. :-)

The first thing I tried was the simpler is_equal_raw:

$ rebar diff -e memchr/memmem tmp/base.csv tmp/simpler-is-equal-raw.csv -t 1.1
benchmark                                          engine                       tmp/base.csv       tmp/simpler-is-equal-raw.csv
---------                                          ------                       ------------       ----------------------------
memmem/code/rust-library-never-fn-strength         rust/memchr/memmem/prebuilt  40.6 GB/s (1.24x)  50.2 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren   rust/memchr/memmem/prebuilt  39.4 GB/s (1.30x)  51.3 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux             rust/memchr/memmem/prebuilt  41.0 GB/s (1.25x)  51.4 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str          rust/memchr/memmem/prebuilt  38.8 GB/s (1.35x)  52.4 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty        rust/memchr/memmem/prebuilt  40.2 GB/s (1.25x)  50.3 GB/s (1.00x)
memmem/code/rust-library-common-fn                 rust/memchr/memmem/prebuilt  24.7 GB/s (1.18x)  29.2 GB/s (1.00x)
memmem/code/rust-library-common-let                rust/memchr/memmem/prebuilt  18.1 GB/s (1.10x)  20.0 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash               rust/memchr/memmem/prebuilt  38.5 GB/s (1.30x)  50.0 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash             rust/memchr/memmem/prebuilt  38.0 GB/s (1.32x)  50.1 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky      rust/memchr/memmem/prebuilt  36.6 GB/s (1.72x)  62.8 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet  rust/memchr/memmem/prebuilt  4.0 GB/s (1.33x)   5.4 GB/s (1.00x)
memmem/sliceslice/seemingly-random                 rust/memchr/memmem/prebuilt  8.1 MB/s (1.17x)   9.5 MB/s (1.00x)
memmem/sliceslice/i386                             rust/memchr/memmem/prebuilt  42.7 MB/s (1.25x)  53.5 MB/s (1.00x)
memmem/subtitles/common/huge-en-that               rust/memchr/memmem/prebuilt  30.7 GB/s (1.22x)  37.4 GB/s (1.00x)
memmem/subtitles/common/huge-ru-that               rust/memchr/memmem/prebuilt  28.8 GB/s (1.26x)  36.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space          rust/memchr/memmem/prebuilt  2.6 GB/s (1.00x)   2.3 GB/s (1.12x)
memmem/subtitles/common/huge-zh-that               rust/memchr/memmem/prebuilt  27.2 GB/s (1.40x)  38.2 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not             rust/memchr/memmem/prebuilt  18.0 GB/s (1.14x)  20.4 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson         rust/memchr/memmem/prebuilt  39.8 GB/s (1.55x)  61.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes    rust/memchr/memmem/prebuilt  41.1 GB/s (1.25x)  51.4 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes     rust/memchr/memmem/prebuilt  40.2 GB/s (1.57x)  63.0 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space           rust/memchr/memmem/prebuilt  33.2 GB/s (1.89x)  62.8 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson         rust/memchr/memmem/prebuilt  39.2 GB/s (1.61x)  63.1 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson         rust/memchr/memmem/prebuilt  39.6 GB/s (1.43x)  56.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes      rust/memchr/memmem/prebuilt  39.3 GB/s (1.55x)  61.0 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock             rust/memchr/memmem/prebuilt  43.5 GB/s (1.25x)  54.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle        rust/memchr/memmem/prebuilt  32.9 GB/s (1.60x)  52.7 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes      rust/memchr/memmem/prebuilt  38.7 GB/s (1.58x)  61.0 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock             rust/memchr/memmem/prebuilt  39.5 GB/s (1.51x)  59.4 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes      rust/memchr/memmem/prebuilt  35.9 GB/s (1.29x)  46.3 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock             rust/memchr/memmem/prebuilt  35.5 GB/s (1.56x)  55.1 GB/s (1.00x)

And for the OP's issue using the reproduction repository I made:

$ hyperfine -w5 ./target/release/memchr-2.5.0 ./target/release/memchr-2.6.4 ./target/release/simpler-is-equal-raw
Benchmark 1: ./target/release/memchr-2.5.0
  Time (mean ± σ):      1.103 s ±  0.098 s    [User: 1.102 s, System: 0.001 s]
  Range (min … max):    1.006 s …  1.260 s    10 runs

Benchmark 2: ./target/release/memchr-2.6.4
  Time (mean ± σ):      1.096 s ±  0.092 s    [User: 1.095 s, System: 0.001 s]
  Range (min … max):    1.031 s …  1.286 s    10 runs

Benchmark 3: ./target/release/simpler-is-equal-raw
  Time (mean ± σ):      1.035 s ±  0.134 s    [User: 1.035 s, System: 0.000 s]
  Range (min … max):    0.798 s …  1.178 s    10 runs

Summary
  ./target/release/simpler-is-equal-raw ran
    1.06 ± 0.16 times faster than ./target/release/memchr-2.6.4
    1.07 ± 0.17 times faster than ./target/release/memchr-2.5.0

And indeed, looking at the codegen, I can see that one of the mov instructions was now gone. Sweet. Although, I find it interesting that memchr-2.5.0 and memchr-2.6.4 are virtually the same speed, despite the fact that simpler-is-equal-raw has one fewer mov than memchr-2.6.4 and one more mov than memchr-2.5.0. Perhaps the other changes to is_equal_raw are having an impact here. OK, so what happens if I run the hyperfine command again?

$ hyperfine -w5 ./target/release/memchr-2.5.0 ./target/release/memchr-2.6.4 ./target/release/simpler-is-equal-raw
Benchmark 1: ./target/release/memchr-2.5.0
  Time (mean ± σ):      1.071 s ±  0.068 s    [User: 1.070 s, System: 0.001 s]
  Range (min … max):    1.011 s …  1.258 s    10 runs

Benchmark 2: ./target/release/memchr-2.6.4
  Time (mean ± σ):      1.058 s ±  0.019 s    [User: 1.057 s, System: 0.001 s]
  Range (min … max):    1.030 s …  1.093 s    10 runs

Benchmark 3: ./target/release/simpler-is-equal-raw
  Time (mean ± σ):      1.140 s ±  0.118 s    [User: 1.139 s, System: 0.001 s]
  Range (min … max):    0.918 s …  1.382 s    10 runs

Summary
  ./target/release/memchr-2.6.4 ran
    1.01 ± 0.07 times faster than ./target/release/memchr-2.5.0
    1.08 ± 0.11 times faster than ./target/release/simpler-is-equal-raw

OK, so it's just noise. Sigh. What about the rebar benchmarks?

$ rebar diff -e memchr/memmem tmp/base.csv tmp/simpler-is-equal-raw.csv tmp/simpler-is-equal-raw-2.csv -t 1.1
benchmark                                          engine                       tmp/base.csv       tmp/simpler-is-equal-raw.csv  tmp/simpler-is-equal-raw-2.csv
---------                                          ------                       ------------       ----------------------------  ------------------------------
memmem/code/rust-library-never-fn-strength         rust/memchr/memmem/prebuilt  40.6 GB/s (1.27x)  50.2 GB/s (1.02x)             51.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren   rust/memchr/memmem/prebuilt  39.4 GB/s (1.33x)  51.3 GB/s (1.02x)             52.3 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux             rust/memchr/memmem/prebuilt  41.0 GB/s (1.25x)  51.4 GB/s (1.00x)             37.0 GB/s (1.39x)
memmem/code/rust-library-rare-fn-from-str          rust/memchr/memmem/prebuilt  38.8 GB/s (1.35x)  52.4 GB/s (1.00x)             50.8 GB/s (1.03x)
memmem/code/rust-library-common-fn-is-empty        rust/memchr/memmem/prebuilt  40.2 GB/s (1.26x)  50.3 GB/s (1.00x)             50.4 GB/s (1.00x)
memmem/code/rust-library-common-fn                 rust/memchr/memmem/prebuilt  24.7 GB/s (1.19x)  29.2 GB/s (1.01x)             29.5 GB/s (1.00x)
memmem/code/rust-library-common-let                rust/memchr/memmem/prebuilt  18.1 GB/s (1.15x)  20.0 GB/s (1.04x)             20.8 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash               rust/memchr/memmem/prebuilt  38.5 GB/s (1.30x)  50.0 GB/s (1.00x)             49.8 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash             rust/memchr/memmem/prebuilt  38.0 GB/s (1.33x)  50.1 GB/s (1.00x)             50.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky      rust/memchr/memmem/prebuilt  36.6 GB/s (1.72x)  62.8 GB/s (1.00x)             62.6 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet  rust/memchr/memmem/prebuilt  4.0 GB/s (1.33x)   5.4 GB/s (1.00x)              5.4 GB/s (1.00x)
memmem/sliceslice/seemingly-random                 rust/memchr/memmem/prebuilt  8.1 MB/s (1.19x)   9.5 MB/s (1.02x)              9.7 MB/s (1.00x)
memmem/sliceslice/i386                             rust/memchr/memmem/prebuilt  42.7 MB/s (1.27x)  53.5 MB/s (1.01x)             54.1 MB/s (1.00x)
memmem/subtitles/common/huge-en-that               rust/memchr/memmem/prebuilt  30.7 GB/s (1.22x)  37.4 GB/s (1.00x)             37.6 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                rust/memchr/memmem/prebuilt  14.7 GB/s (1.06x)  15.6 GB/s (1.00x)             12.4 GB/s (1.25x)
memmem/subtitles/common/huge-ru-that               rust/memchr/memmem/prebuilt  28.8 GB/s (1.27x)  36.2 GB/s (1.01x)             36.5 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space          rust/memchr/memmem/prebuilt  2.6 GB/s (1.00x)   2.3 GB/s (1.12x)              2.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that               rust/memchr/memmem/prebuilt  27.2 GB/s (1.40x)  38.2 GB/s (1.00x)             38.2 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not             rust/memchr/memmem/prebuilt  18.0 GB/s (1.14x)  20.4 GB/s (1.00x)             17.5 GB/s (1.17x)
memmem/subtitles/never/huge-en-john-watson         rust/memchr/memmem/prebuilt  39.8 GB/s (1.58x)  61.6 GB/s (1.02x)             62.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes    rust/memchr/memmem/prebuilt  41.1 GB/s (1.25x)  51.4 GB/s (1.00x)             46.3 GB/s (1.11x)
memmem/subtitles/never/huge-en-some-rare-bytes     rust/memchr/memmem/prebuilt  40.2 GB/s (1.57x)  63.0 GB/s (1.00x)             62.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space           rust/memchr/memmem/prebuilt  33.2 GB/s (1.89x)  62.8 GB/s (1.00x)             51.0 GB/s (1.23x)
memmem/subtitles/never/huge-ru-john-watson         rust/memchr/memmem/prebuilt  39.2 GB/s (1.61x)  63.1 GB/s (1.00x)             60.1 GB/s (1.05x)
memmem/subtitles/never/huge-zh-john-watson         rust/memchr/memmem/prebuilt  39.6 GB/s (1.47x)  56.6 GB/s (1.03x)             58.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes      rust/memchr/memmem/prebuilt  39.3 GB/s (1.59x)  61.0 GB/s (1.03x)             62.7 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock             rust/memchr/memmem/prebuilt  43.5 GB/s (1.25x)  54.2 GB/s (1.00x)             53.2 GB/s (1.02x)
memmem/subtitles/rare/huge-en-medium-needle        rust/memchr/memmem/prebuilt  32.9 GB/s (1.60x)  52.7 GB/s (1.00x)             52.7 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes      rust/memchr/memmem/prebuilt  38.7 GB/s (1.60x)  61.0 GB/s (1.02x)             62.1 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock             rust/memchr/memmem/prebuilt  39.5 GB/s (1.51x)  59.4 GB/s (1.00x)             59.0 GB/s (1.01x)
memmem/subtitles/rare/huge-zh-sherlock-holmes      rust/memchr/memmem/prebuilt  35.9 GB/s (1.53x)  46.3 GB/s (1.19x)             54.8 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock             rust/memchr/memmem/prebuilt  35.5 GB/s (1.56x)  55.1 GB/s (1.00x)             54.7 GB/s (1.01x)

OK, so at least according to the rebar benchmarks, the win seems pretty solid. There's still a bit of noise, but the simpler-is-equal-raw measurements are solidly better than the status quo.

Now let's move on to @tavianator's patch (keeping the simpler and apparently faster is_equal_raw routine). I applied the patch and re-ran the rebar benchmarks:

$ rebar diff -e memchr/memmem tmp/base.csv tmp/simpler-is-equal-raw.csv tmp/precompute-indices.csv -t 1.1
benchmark                                          engine                       tmp/base.csv       tmp/simpler-is-equal-raw.csv  tmp/precompute-indices.csv
---------                                          ------                       ------------       ----------------------------  --------------------------
memmem/code/rust-library-never-fn-strength         rust/memchr/memmem/prebuilt  40.6 GB/s (1.31x)  50.2 GB/s (1.06x)             53.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren   rust/memchr/memmem/prebuilt  39.4 GB/s (1.33x)  51.3 GB/s (1.02x)             52.5 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux             rust/memchr/memmem/prebuilt  41.0 GB/s (1.26x)  51.4 GB/s (1.00x)             51.5 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str          rust/memchr/memmem/prebuilt  38.8 GB/s (1.35x)  52.4 GB/s (1.00x)             52.5 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty        rust/memchr/memmem/prebuilt  40.2 GB/s (1.25x)  50.3 GB/s (1.00x)             49.2 GB/s (1.02x)
memmem/code/rust-library-common-fn                 rust/memchr/memmem/prebuilt  24.7 GB/s (1.18x)  29.2 GB/s (1.00x)             25.3 GB/s (1.16x)
memmem/code/rust-library-common-let                rust/memchr/memmem/prebuilt  18.1 GB/s (1.11x)  20.0 GB/s (1.01x)             20.1 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash               rust/memchr/memmem/prebuilt  38.5 GB/s (1.30x)  50.0 GB/s (1.00x)             39.6 GB/s (1.26x)
memmem/pathological/md5-huge-last-hash             rust/memchr/memmem/prebuilt  38.0 GB/s (1.32x)  50.1 GB/s (1.00x)             49.3 GB/s (1.02x)
memmem/pathological/rare-repeated-huge-tricky      rust/memchr/memmem/prebuilt  36.6 GB/s (1.72x)  62.8 GB/s (1.00x)             39.5 GB/s (1.59x)
memmem/pathological/defeat-simple-vector-alphabet  rust/memchr/memmem/prebuilt  4.0 GB/s (1.33x)   5.4 GB/s (1.00x)              5.4 GB/s (1.00x)
memmem/sliceslice/seemingly-random                 rust/memchr/memmem/prebuilt  8.1 MB/s (1.17x)   9.5 MB/s (1.01x)              9.6 MB/s (1.00x)
memmem/sliceslice/i386                             rust/memchr/memmem/prebuilt  42.7 MB/s (1.25x)  53.5 MB/s (1.00x)             36.5 MB/s (1.46x)
memmem/subtitles/common/huge-en-that               rust/memchr/memmem/prebuilt  30.7 GB/s (1.22x)  37.4 GB/s (1.00x)             29.4 GB/s (1.27x)
memmem/subtitles/common/huge-ru-that               rust/memchr/memmem/prebuilt  28.8 GB/s (1.26x)  36.2 GB/s (1.00x)             34.7 GB/s (1.04x)
memmem/subtitles/common/huge-ru-one-space          rust/memchr/memmem/prebuilt  2.6 GB/s (1.00x)   2.3 GB/s (1.12x)              2.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that               rust/memchr/memmem/prebuilt  27.2 GB/s (1.40x)  38.2 GB/s (1.00x)             31.7 GB/s (1.21x)
memmem/subtitles/common/huge-zh-do-not             rust/memchr/memmem/prebuilt  18.0 GB/s (1.14x)  20.4 GB/s (1.00x)             19.6 GB/s (1.04x)
memmem/subtitles/never/huge-en-john-watson         rust/memchr/memmem/prebuilt  39.8 GB/s (1.55x)  61.6 GB/s (1.00x)             57.8 GB/s (1.07x)
memmem/subtitles/never/huge-en-all-common-bytes    rust/memchr/memmem/prebuilt  41.1 GB/s (1.25x)  51.4 GB/s (1.00x)             46.0 GB/s (1.12x)
memmem/subtitles/never/huge-en-some-rare-bytes     rust/memchr/memmem/prebuilt  40.2 GB/s (1.57x)  63.0 GB/s (1.00x)             51.7 GB/s (1.22x)
memmem/subtitles/never/huge-en-two-space           rust/memchr/memmem/prebuilt  33.2 GB/s (1.89x)  62.8 GB/s (1.00x)             38.2 GB/s (1.65x)
memmem/subtitles/never/huge-ru-john-watson         rust/memchr/memmem/prebuilt  39.2 GB/s (1.61x)  63.1 GB/s (1.00x)             56.6 GB/s (1.11x)
memmem/subtitles/never/huge-zh-john-watson         rust/memchr/memmem/prebuilt  39.6 GB/s (1.43x)  56.6 GB/s (1.00x)             30.3 GB/s (1.87x)
memmem/subtitles/rare/huge-en-sherlock-holmes      rust/memchr/memmem/prebuilt  39.3 GB/s (1.55x)  61.0 GB/s (1.00x)             54.7 GB/s (1.12x)
memmem/subtitles/rare/huge-en-sherlock             rust/memchr/memmem/prebuilt  43.5 GB/s (1.31x)  54.2 GB/s (1.06x)             57.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle        rust/memchr/memmem/prebuilt  32.9 GB/s (1.68x)  52.7 GB/s (1.04x)             55.1 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes      rust/memchr/memmem/prebuilt  38.7 GB/s (1.58x)  61.0 GB/s (1.00x)             56.3 GB/s (1.08x)
memmem/subtitles/rare/huge-ru-sherlock             rust/memchr/memmem/prebuilt  39.5 GB/s (1.51x)  59.4 GB/s (1.00x)             56.9 GB/s (1.04x)
memmem/subtitles/rare/huge-zh-sherlock-holmes      rust/memchr/memmem/prebuilt  35.9 GB/s (1.29x)  46.3 GB/s (1.00x)             29.6 GB/s (1.56x)
memmem/subtitles/rare/huge-zh-sherlock             rust/memchr/memmem/prebuilt  35.5 GB/s (1.56x)  55.1 GB/s (1.00x)             47.2 GB/s (1.17x)

My interpretation there is that for some benchmarks, it's about on par with simpler-is-equal-raw, but there are several benchmarks where it appears to regress back to the status quo. The hyperfine benchmarks from the reproduction repository are already established to be very noisy, but if I look at the codegen there via perf, I see this:

precompute-indices-codegen

Interestingly, while the mov instructions are gone, there are extra add instructions in its place. I guess this makes sense to me since this does seem to require a few more adds/subs in the critical section.

So I think for now, I'm just going to go with the simpler is_equal_raw and see how far that gets us. Thank you both again!

BurntSushi commented 6 months ago

So to sum up, there is still one more mov instruction in the critical path that wasn't there in memchr 2.5.0. Since I can't seem to completely reproduce the problem in the OP, I'm not sure if removing one of the mov instructions is enough to fix the regression. But I'm happy to start there for now.

It does look to me like the code itself could be improved. If there's a lot of register pressure forcing stack spilling, then that's not good. It would be nice to brainstorm ways of fixing that. (Ideally not by using inline assembly.)

BurntSushi commented 6 months ago

This should be largely fixed in memchr 2.7.0 on crates.io. I welcome further improvements here.