Closed Shnatsel closed 1 year ago
Merging #141 (17c049d) into main (c836289) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #141 +/- ##
=======================================
Coverage 89.29% 89.30%
=======================================
Files 13 13
Lines 2485 2487 +2
=======================================
+ Hits 2219 2221 +2
Misses 266 266
Impacted Files | Coverage Δ | |
---|---|---|
src/sink.rs | 97.56% <100.00%> (+0.04%) |
:arrow_up: |
Thanks for the PR, but it seems this version is slightly slower (I ran it two times). I'm not sure why. The loop is bounds check free, but overall the function has more instructions. This would suggest that the data has many small overlapping calls, where the check upfront outweighs the check free loop.
It also seems that two checks are unnecessary (unwrap and window is non-zero)
They can be removed with .windows(pos.saturating_sub(start).saturating_add(1) );
But then the compilers unrolls the loop, which is even slower. https://godbolt.org/z/8neTcosoq
Benchmark of this PR:
BlockDecompress/lz4_flex_rust/725
time: [274.14 ns 276.68 ns 280.44 ns]
thrpt: [2.4077 GiB/s 2.4404 GiB/s 2.4630 GiB/s]
change:
time: [+0.6372% +1.4770% +2.3369%] (p = 0.00 < 0.05)
thrpt: [-2.2835% -1.4555% -0.6331%]
Change within noise threshold.
BlockDecompress/lz4_flex_rust/34308
time: [18.727 µs 18.864 µs 18.996 µs]
thrpt: [1.6821 GiB/s 1.6938 GiB/s 1.7062 GiB/s]
change:
time: [-1.3487% -0.8570% -0.3486%] (p = 0.00 < 0.05)
thrpt: [+0.3498% +0.8644% +1.3671%]
Change within noise threshold.
BlockDecompress/lz4_flex_rust/64723
time: [34.075 µs 34.219 µs 34.374 µs]
thrpt: [1.7536 GiB/s 1.7616 GiB/s 1.7690 GiB/s]
change:
time: [-0.4305% +0.3407% +1.0611%] (p = 0.38 > 0.05)
thrpt: [-1.0499% -0.3395% +0.4324%]
No change in performance detected.
BlockDecompress/lz4_flex_rust/66675
time: [16.245 µs 16.345 µs 16.449 µs]
thrpt: [3.7750 GiB/s 3.7992 GiB/s 3.8225 GiB/s]
change:
time: [+4.4317% +5.3461% +6.2392%] (p = 0.00 < 0.05)
thrpt: [-5.8728% -5.0748% -4.2437%]
Performance has regressed.
BlockDecompress/lz4_flex_rust/9991663
time: [4.9801 ms 5.0305 ms 5.0872 ms]
thrpt: [1.8292 GiB/s 1.8498 GiB/s 1.8685 GiB/s]
change:
time: [+1.8810% +3.4240% +5.0104%] (p = 0.00 < 0.05)
thrpt: [-4.7714% -3.3106% -1.8462%]
Performance has regressed.
It might be faster on nightly with the no-unrolling hint, but since most people are on stable, I think it's best to leave it as is.
I've recently learned that intrinsics such as saturating_add()
do not in fact map to a native CPU implementation, and are implemented as a cmov
on x86.
In my experience, a cmov
is more expensive than a correctly predicted branch, and branches leading to panics are always correctly predicted - the panic case is hinted to be unlikely. So checked_add(x).unwrap()
should be faster than saturating_add(x)
.
I'm writing this more for posterity - it probably won't save this PR because unrolling is getting in the way regardless.
Use an iterator to avoid bounds checks.
You can't use
slice.windows()
on mutable slices because that would result in overlapping mutable slices if.collect()
ed, but you can turn a&mut [u8]
into&[Cell<u8>]
and copy theu8
s one by one.Godbolt showing no bounds checks in the hot loop (under
.LBB0_5:
): https://godbolt.org/z/aEcj7WYjf