BurntSushi / memchr

Optimized string search routines for Rust.
The Unlicense
799 stars 97 forks source link

arch: simplify and improve is_equal_raw #142

Closed BurntSushi closed 6 months ago

BurntSushi commented 6 months ago

This came from @jhorstmann in #139. It simplifies is_equal_raw and also in turn simplifies its codegen. Since this routine gets inlined into others, this winds up being a fair improvement:

$ rebar diff -e memchr/memmem tmp/base.csv tmp/simpler-is-equal-raw.csv -t 1.1
benchmark                                          engine                       tmp/base.csv       tmp/simpler-is-equal-raw.csv
---------                                          ------                       ------------       ----------------------------
memmem/code/rust-library-never-fn-strength         rust/memchr/memmem/prebuilt  40.6 GB/s (1.24x)  50.2 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren   rust/memchr/memmem/prebuilt  39.4 GB/s (1.30x)  51.3 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux             rust/memchr/memmem/prebuilt  41.0 GB/s (1.25x)  51.4 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str          rust/memchr/memmem/prebuilt  38.8 GB/s (1.35x)  52.4 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty        rust/memchr/memmem/prebuilt  40.2 GB/s (1.25x)  50.3 GB/s (1.00x)
memmem/code/rust-library-common-fn                 rust/memchr/memmem/prebuilt  24.7 GB/s (1.18x)  29.2 GB/s (1.00x)
memmem/code/rust-library-common-let                rust/memchr/memmem/prebuilt  18.1 GB/s (1.10x)  20.0 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash               rust/memchr/memmem/prebuilt  38.5 GB/s (1.30x)  50.0 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash             rust/memchr/memmem/prebuilt  38.0 GB/s (1.32x)  50.1 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky      rust/memchr/memmem/prebuilt  36.6 GB/s (1.72x)  62.8 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet  rust/memchr/memmem/prebuilt  4.0 GB/s (1.33x)   5.4 GB/s (1.00x)
memmem/sliceslice/seemingly-random                 rust/memchr/memmem/prebuilt  8.1 MB/s (1.17x)   9.5 MB/s (1.00x)
memmem/sliceslice/i386                             rust/memchr/memmem/prebuilt  42.7 MB/s (1.25x)  53.5 MB/s (1.00x)
memmem/subtitles/common/huge-en-that               rust/memchr/memmem/prebuilt  30.7 GB/s (1.22x)  37.4 GB/s (1.00x)
memmem/subtitles/common/huge-ru-that               rust/memchr/memmem/prebuilt  28.8 GB/s (1.26x)  36.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space          rust/memchr/memmem/prebuilt  2.6 GB/s (1.00x)   2.3 GB/s (1.12x)
memmem/subtitles/common/huge-zh-that               rust/memchr/memmem/prebuilt  27.2 GB/s (1.40x)  38.2 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not             rust/memchr/memmem/prebuilt  18.0 GB/s (1.14x)  20.4 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson         rust/memchr/memmem/prebuilt  39.8 GB/s (1.55x)  61.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes    rust/memchr/memmem/prebuilt  41.1 GB/s (1.25x)  51.4 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes     rust/memchr/memmem/prebuilt  40.2 GB/s (1.57x)  63.0 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space           rust/memchr/memmem/prebuilt  33.2 GB/s (1.89x)  62.8 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson         rust/memchr/memmem/prebuilt  39.2 GB/s (1.61x)  63.1 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson         rust/memchr/memmem/prebuilt  39.6 GB/s (1.43x)  56.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes      rust/memchr/memmem/prebuilt  39.3 GB/s (1.55x)  61.0 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock             rust/memchr/memmem/prebuilt  43.5 GB/s (1.25x)  54.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle        rust/memchr/memmem/prebuilt  32.9 GB/s (1.60x)  52.7 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes      rust/memchr/memmem/prebuilt  38.7 GB/s (1.58x)  61.0 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock             rust/memchr/memmem/prebuilt  39.5 GB/s (1.51x)  59.4 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes      rust/memchr/memmem/prebuilt  35.9 GB/s (1.29x)  46.3 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock             rust/memchr/memmem/prebuilt  35.5 GB/s (1.56x)  55.1 GB/s (1.00x)

See my commentary in #139 for more discussion.

Fixes #139

cgbur commented 6 months ago

Confirming that this PR improves the performance on my AMD system (5900x) which seems to be more impacted by the 2.6 refactor. Benchmarking the scenario created in #139:

❯ poop "./memchr-2.5.0" "./memchr-2.6.4" "./memchr-pr" -d 20000
Benchmark 1 (27 runs): ./memchr-2.5.0
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           766ms ± 11.9ms     746ms …  792ms          0 ( 0%)        0%
  peak_rss           2.25MB ±  114KB    1.97MB … 2.36MB          0 ( 0%)        0%
  cpu_cycles         3.53G  ± 49.6M     3.43G  … 3.66G           0 ( 0%)        0%
  instructions       12.9G  ±  315      12.9G  … 12.9G           0 ( 0%)        0%
  cache_references   1.06G  ± 43.5M      985M  … 1.14G           0 ( 0%)        0%
  cache_misses       20.7M  ± 3.06M     14.8M  … 27.2M           0 ( 0%)        0%
  branch_misses      14.0M  ± 88.3K     13.8M  … 14.1M           0 ( 0%)        0%
Benchmark 2 (23 runs): ./memchr-2.6.4
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           906ms ± 14.6ms     873ms …  931ms          0 ( 0%)        💩+ 18.3% ±  1.0%
  peak_rss           2.28MB ± 98.1KB    2.16MB … 2.36MB          0 ( 0%)          +  1.4% ±  2.7%
  cpu_cycles         4.22G  ± 53.3M     4.14G  … 4.35G           0 ( 0%)        💩+ 19.5% ±  0.8%
  instructions       15.7G  ±  247      15.7G  … 15.7G           1 ( 4%)        💩+ 21.7% ±  0.0%
  cache_references   1.00G  ± 67.0M      871M  … 1.16G           0 ( 0%)        ⚡-  5.6% ±  3.0%
  cache_misses       19.8M  ± 3.94M     12.7M  … 26.6M           0 ( 0%)          -  4.3% ±  9.7%
  branch_misses      13.8M  ± 76.3K     13.7M  … 13.9M           2 ( 9%)          -  1.3% ±  0.3%
Benchmark 3 (25 runs): ./memchr-pr
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           807ms ± 15.2ms     779ms …  834ms          0 ( 0%)        💩+  5.4% ±  1.0%
  peak_rss           2.29MB ± 92.7KB    2.16MB … 2.36MB          0 ( 0%)          +  1.9% ±  2.6%
  cpu_cycles         3.72G  ± 67.1M     3.61G  … 3.82G           0 ( 0%)        💩+  5.2% ±  0.9%
  instructions       14.3G  ±  299      14.3G  … 14.3G           1 ( 4%)        💩+ 10.6% ±  0.0%
  cache_references   1.05G  ± 47.6M      929M  … 1.16G           1 ( 4%)          -  1.1% ±  2.4%
  cache_misses       20.7M  ± 2.71M     16.5M  … 25.9M           0 ( 0%)          +  0.1% ±  7.8%
  branch_misses      14.2M  ± 93.9K     14.0M  … 14.4M           0 ( 0%)          +  1.1% ±  0.4%
BurntSushi commented 6 months ago

This PR is on crates.io in memchr 2.7.0.