Closed str4d closed 3 years ago
Despite causing 24 fewer vmovdqa
operations to be generated in the assembly, I see no effect on the benchmarks (on my machine).
Current master:
Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10 ... bench: 8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100 ... bench: 37 ns/iter (+/- 2) = 2702 MB/s
test bench3_1000 ... bench: 278 ns/iter (+/- 8) = 3597 MB/s
test bench4_10000 ... bench: 2,607 ns/iter (+/- 116) = 3835 MB/s
test bench5_100000 ... bench: 26,223 ns/iter (+/- 1,733) = 3813 MB/s
Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10 ... bench: 9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100 ... bench: 51 ns/iter (+/- 3) = 1960 MB/s
test bench3_1000 ... bench: 420 ns/iter (+/- 8) = 2380 MB/s
test bench4_10000 ... bench: 4,025 ns/iter (+/- 93) = 2484 MB/s
test bench5_100000 ... bench: 40,054 ns/iter (+/- 1,197) = 2496 MB/s
Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10 ... bench: 8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100 ... bench: 30 ns/iter (+/- 2) = 3333 MB/s
test bench3_1000 ... bench: 205 ns/iter (+/- 6) = 4878 MB/s
test bench4_10000 ... bench: 1,910 ns/iter (+/- 87) = 5235 MB/s
test bench5_100000 ... bench: 19,236 ns/iter (+/- 487) = 5198 MB/s
This PR:
Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10 ... bench: 8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100 ... bench: 35 ns/iter (+/- 1) = 2857 MB/s
test bench3_1000 ... bench: 282 ns/iter (+/- 11) = 3546 MB/s
test bench4_10000 ... bench: 2,665 ns/iter (+/- 161) = 3752 MB/s
test bench5_100000 ... bench: 26,397 ns/iter (+/- 1,435) = 3788 MB/s
Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10 ... bench: 9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100 ... bench: 51 ns/iter (+/- 2) = 1960 MB/s
test bench3_1000 ... bench: 419 ns/iter (+/- 20) = 2386 MB/s
test bench4_10000 ... bench: 3,998 ns/iter (+/- 129) = 2501 MB/s
test bench5_100000 ... bench: 40,263 ns/iter (+/- 1,884) = 2483 MB/s
Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10 ... bench: 7 ns/iter (+/- 0) = 1428 MB/s
test bench2_100 ... bench: 30 ns/iter (+/- 2) = 3333 MB/s
test bench3_1000 ... bench: 216 ns/iter (+/- 10) = 4629 MB/s
test bench4_10000 ... bench: 1,983 ns/iter (+/- 80) = 5042 MB/s
test bench5_100000 ... bench: 19,965 ns/iter (+/- 694) = 5008 MB/s
When compiled with RUSTFLAGS="-Ctarget-feature=+avx2"
, this PR generates almost exactly the same assembly as current master (a handful operations are reordered), so the changes only affect autodetect mode (and apparently not materially on my machine, but maybe it's useful on others).
The use of
&mut StateWord
everywhere caused avmovdqa
to be inserted after almost every operation, and also caused the diagonalization to usevpermilps
instead of seeing the optimisation tovpshufd
.The new
State
struct helps to manage the passing-around of ownedStateWord
s.