Closed str4d closed 3 years ago
I originally tried interleaving the AVX2 operations for pairs of blocks with a macro, but the macro was getting rather complex, so I switched to the approach in this PR (adding methods to StateWord
). Turns out that approach requires an MSRV bump (I didn't realise chacha20
wasn't on 1.51 yet). I can switch back to the macro-based approach if desired (for which I'd probably instead move to using one of the array macro crate dependencies, if there isn't something else I can use in the RustCrypto crate ecosystem for this).
I'd say it's fine to bump MSRV. 1.51 is now 6 months old, which is plenty of time IMO.
This PR eliminates the performance difference between chacha20
and c2-chacha
for a 2GB test when compiled with +avx2
, and reduces the gap significantly for autodetect mode: https://github.com/str4d/rage/issues/57#issuecomment-907709912
The remaining issue is that for the rng
feature, the doubled buffer size means the Results
type is now [u32; 64]
which doesn't impl Default
😢
This PR eliminates the performance difference...
Awesome! 🎉
The remaining issue is that for the
rng
feature...
Wow, that's really annoying. I was curious what rand_chacha
did here and it looks like they completely abandoned the AVX2 backend?
Relevant PR: https://github.com/rust-random/rand/pull/931
cc @dhardy
Edit: never mind, it still uses ppv-lite86
and its AVX2 backend
I was curious what
rand_chacha
did here
It uses a #[repr(transparent)]
wrapper type: https://github.com/rust-random/rand/blob/ee1aacd257d0e0bdbf27342c07e04270465e09c5/rand_chacha/src/chacha.rs#L26-L44
Tests are now passing.
Ran benchmarks on my machine (i7-8700K overclocked to 4.8 GHz):
$ cargo +nightly --version
cargo 1.56.0-nightly (b51439fd8 2021-08-09)
Current master (0.7.3):
Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10 ... bench: 9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100 ... bench: 50 ns/iter (+/- 2) = 2000 MB/s
test bench3_1000 ... bench: 397 ns/iter (+/- 9) = 2518 MB/s
test bench4_10000 ... bench: 3,889 ns/iter (+/- 168) = 2571 MB/s
test bench5_100000 ... bench: 38,739 ns/iter (+/- 1,431) = 2581 MB/s
Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10 ... bench: 12 ns/iter (+/- 0) = 833 MB/s
test bench2_100 ... bench: 72 ns/iter (+/- 3) = 1388 MB/s
test bench3_1000 ... bench: 614 ns/iter (+/- 30) = 1628 MB/s
test bench4_10000 ... bench: 5,959 ns/iter (+/- 244) = 1678 MB/s
test bench5_100000 ... bench: 59,545 ns/iter (+/- 1,724) = 1679 MB/s
Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10 ... bench: 8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100 ... bench: 38 ns/iter (+/- 1) = 2631 MB/s
test bench3_1000 ... bench: 295 ns/iter (+/- 8) = 3389 MB/s
test bench4_10000 ... bench: 2,844 ns/iter (+/- 108) = 3516 MB/s
test bench5_100000 ... bench: 28,393 ns/iter (+/- 2,068) = 3521 MB/s
This PR:
Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10 ... bench: 8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100 ... bench: 37 ns/iter (+/- 2) = 2702 MB/s
test bench3_1000 ... bench: 285 ns/iter (+/- 12) = 3508 MB/s
test bench4_10000 ... bench: 2,691 ns/iter (+/- 137) = 3716 MB/s
test bench5_100000 ... bench: 26,804 ns/iter (+/- 1,187) = 3730 MB/s
Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10 ... bench: 9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100 ... bench: 52 ns/iter (+/- 3) = 1923 MB/s
test bench3_1000 ... bench: 432 ns/iter (+/- 23) = 2314 MB/s
test bench4_10000 ... bench: 4,126 ns/iter (+/- 133) = 2423 MB/s
test bench5_100000 ... bench: 41,191 ns/iter (+/- 1,258) = 2427 MB/s
Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10 ... bench: 8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100 ... bench: 31 ns/iter (+/- 1) = 3225 MB/s
test bench3_1000 ... bench: 211 ns/iter (+/- 12) = 4739 MB/s
test bench4_10000 ... bench: 1,978 ns/iter (+/- 101) = 5055 MB/s
test bench5_100000 ... bench: 19,835 ns/iter (+/- 759) = 5041 MB/s
Looks like you've answered your question, but you'd better ask @kazcw.
We switch to a 4-block buffer for the combined SSE2 / AVX2 backend, which allows the AVX2 backend to process them together, while the SSE2 backend continues to process one block at a time.
The AVX2 backend is refactored to enable interleaving the instructions per pair of blocks, for better ILP.
Closes #262.