RustCrypto / stream-ciphers

Collection of stream cipher algorithms
264 stars 50 forks source link

chacha20: Remove mutable borrows from AVX2 backend #268

Closed str4d closed 3 years ago

str4d commented 3 years ago

The use of &mut StateWord everywhere caused a vmovdqa to be inserted after almost every operation, and also caused the diagonalization to use vpermilps instead of seeing the optimisation to vpshufd.

The new State struct helps to manage the passing-around of owned StateWords.

str4d commented 3 years ago

Despite causing 24 fewer vmovdqa operations to be generated in the assembly, I see no effect on the benchmarks (on my machine).

Current master:

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          37 ns/iter (+/- 2) = 2702 MB/s
test bench3_1000   ... bench:         278 ns/iter (+/- 8) = 3597 MB/s
test bench4_10000  ... bench:       2,607 ns/iter (+/- 116) = 3835 MB/s
test bench5_100000 ... bench:      26,223 ns/iter (+/- 1,733) = 3813 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          51 ns/iter (+/- 3) = 1960 MB/s
test bench3_1000   ... bench:         420 ns/iter (+/- 8) = 2380 MB/s
test bench4_10000  ... bench:       4,025 ns/iter (+/- 93) = 2484 MB/s
test bench5_100000 ... bench:      40,054 ns/iter (+/- 1,197) = 2496 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          30 ns/iter (+/- 2) = 3333 MB/s
test bench3_1000   ... bench:         205 ns/iter (+/- 6) = 4878 MB/s
test bench4_10000  ... bench:       1,910 ns/iter (+/- 87) = 5235 MB/s
test bench5_100000 ... bench:      19,236 ns/iter (+/- 487) = 5198 MB/s

This PR:

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          35 ns/iter (+/- 1) = 2857 MB/s
test bench3_1000   ... bench:         282 ns/iter (+/- 11) = 3546 MB/s
test bench4_10000  ... bench:       2,665 ns/iter (+/- 161) = 3752 MB/s
test bench5_100000 ... bench:      26,397 ns/iter (+/- 1,435) = 3788 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          51 ns/iter (+/- 2) = 1960 MB/s
test bench3_1000   ... bench:         419 ns/iter (+/- 20) = 2386 MB/s
test bench4_10000  ... bench:       3,998 ns/iter (+/- 129) = 2501 MB/s
test bench5_100000 ... bench:      40,263 ns/iter (+/- 1,884) = 2483 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           7 ns/iter (+/- 0) = 1428 MB/s
test bench2_100    ... bench:          30 ns/iter (+/- 2) = 3333 MB/s
test bench3_1000   ... bench:         216 ns/iter (+/- 10) = 4629 MB/s
test bench4_10000  ... bench:       1,983 ns/iter (+/- 80) = 5042 MB/s
test bench5_100000 ... bench:      19,965 ns/iter (+/- 694) = 5008 MB/s
str4d commented 3 years ago

When compiled with RUSTFLAGS="-Ctarget-feature=+avx2", this PR generates almost exactly the same assembly as current master (a handful operations are reordered), so the changes only affect autodetect mode (and apparently not materially on my machine, but maybe it's useful on others).