PSeitz / lz4_flex

Fastest pure Rust implementation of LZ4 compression/decompression.
MIT License
460 stars 31 forks source link

Remove bounds checks from `extend_from_within_overlapping` #141

Closed Shnatsel closed 1 year ago

Shnatsel commented 1 year ago

Use an iterator to avoid bounds checks.

You can't use slice.windows() on mutable slices because that would result in overlapping mutable slices if .collect()ed, but you can turn a &mut [u8] into &[Cell<u8>] and copy the u8s one by one.

Godbolt showing no bounds checks in the hot loop (under .LBB0_5:): https://godbolt.org/z/aEcj7WYjf

codecov[bot] commented 1 year ago

Codecov Report

Merging #141 (17c049d) into main (c836289) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #141   +/-   ##
=======================================
  Coverage   89.29%   89.30%           
=======================================
  Files          13       13           
  Lines        2485     2487    +2     
=======================================
+ Hits         2219     2221    +2     
  Misses        266      266           
Impacted Files Coverage Δ
src/sink.rs 97.56% <100.00%> (+0.04%) :arrow_up:
PSeitz commented 1 year ago

Thanks for the PR, but it seems this version is slightly slower (I ran it two times). I'm not sure why. The loop is bounds check free, but overall the function has more instructions. This would suggest that the data has many small overlapping calls, where the check upfront outweighs the check free loop.

It also seems that two checks are unnecessary (unwrap and window is non-zero) They can be removed with .windows(pos.saturating_sub(start).saturating_add(1) ); But then the compilers unrolls the loop, which is even slower. https://godbolt.org/z/8neTcosoq

Benchmark of this PR:

BlockDecompress/lz4_flex_rust/725
                        time:   [274.14 ns 276.68 ns 280.44 ns]
                        thrpt:  [2.4077 GiB/s 2.4404 GiB/s 2.4630 GiB/s]
                 change:
                        time:   [+0.6372% +1.4770% +2.3369%] (p = 0.00 < 0.05)
                        thrpt:  [-2.2835% -1.4555% -0.6331%]
                        Change within noise threshold.
BlockDecompress/lz4_flex_rust/34308
                        time:   [18.727 µs 18.864 µs 18.996 µs]
                        thrpt:  [1.6821 GiB/s 1.6938 GiB/s 1.7062 GiB/s]
                 change:
                        time:   [-1.3487% -0.8570% -0.3486%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3498% +0.8644% +1.3671%]
                        Change within noise threshold.
BlockDecompress/lz4_flex_rust/64723
                        time:   [34.075 µs 34.219 µs 34.374 µs]
                        thrpt:  [1.7536 GiB/s 1.7616 GiB/s 1.7690 GiB/s]
                 change:
                        time:   [-0.4305% +0.3407% +1.0611%] (p = 0.38 > 0.05)
                        thrpt:  [-1.0499% -0.3395% +0.4324%]
                        No change in performance detected.
BlockDecompress/lz4_flex_rust/66675
                        time:   [16.245 µs 16.345 µs 16.449 µs]
                        thrpt:  [3.7750 GiB/s 3.7992 GiB/s 3.8225 GiB/s]
                 change:
                        time:   [+4.4317% +5.3461% +6.2392%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8728% -5.0748% -4.2437%]
                        Performance has regressed.
BlockDecompress/lz4_flex_rust/9991663
                        time:   [4.9801 ms 5.0305 ms 5.0872 ms]
                        thrpt:  [1.8292 GiB/s 1.8498 GiB/s 1.8685 GiB/s]
                 change:
                        time:   [+1.8810% +3.4240% +5.0104%] (p = 0.00 < 0.05)
                        thrpt:  [-4.7714% -3.3106% -1.8462%]
                        Performance has regressed.
Shnatsel commented 1 year ago

It might be faster on nightly with the no-unrolling hint, but since most people are on stable, I think it's best to leave it as is.

Shnatsel commented 1 year ago

I've recently learned that intrinsics such as saturating_add() do not in fact map to a native CPU implementation, and are implemented as a cmov on x86.

In my experience, a cmov is more expensive than a correctly predicted branch, and branches leading to panics are always correctly predicted - the panic case is hinted to be unlikely. So checked_add(x).unwrap() should be faster than saturating_add(x).

I'm writing this more for posterity - it probably won't save this PR because unrolling is getting in the way regardless.