`chacha20`: Process 4 blocks at a time in AVX2 backend

str4d commented 3 years ago

We switch to a 4-block buffer for the combined SSE2 / AVX2 backend, which allows the AVX2 backend to process them together, while the SSE2 backend continues to process one block at a time.

The AVX2 backend is refactored to enable interleaving the instructions per pair of blocks, for better ILP.

Closes #262.

str4d commented 3 years ago

I originally tried interleaving the AVX2 operations for pairs of blocks with a macro, but the macro was getting rather complex, so I switched to the approach in this PR (adding methods to StateWord). Turns out that approach requires an MSRV bump (I didn't realise chacha20 wasn't on 1.51 yet). I can switch back to the macro-based approach if desired (for which I'd probably instead move to using one of the array macro crate dependencies, if there isn't something else I can use in the RustCrypto crate ecosystem for this).

tarcieri commented 3 years ago

I'd say it's fine to bump MSRV. 1.51 is now 6 months old, which is plenty of time IMO.

str4d commented 3 years ago

This PR eliminates the performance difference between chacha20 and c2-chacha for a 2GB test when compiled with +avx2, and reduces the gap significantly for autodetect mode: https://github.com/str4d/rage/issues/57#issuecomment-907709912

The remaining issue is that for the rng feature, the doubled buffer size means the Results type is now [u32; 64] which doesn't impl Default 😢

tarcieri commented 3 years ago

This PR eliminates the performance difference...

Awesome! 🎉

The remaining issue is that for the rng feature...

Wow, that's really annoying. I was curious what rand_chacha did here ~~and it looks like they completely abandoned the AVX2 backend~~?

Relevant PR: https://github.com/rust-random/rand/pull/931

cc @dhardy

Edit: never mind, it still uses ppv-lite86 and its AVX2 backend

str4d commented 3 years ago

I was curious what rand_chacha did here

It uses a #[repr(transparent)] wrapper type: https://github.com/rust-random/rand/blob/ee1aacd257d0e0bdbf27342c07e04270465e09c5/rand_chacha/src/chacha.rs#L26-L44

str4d commented 3 years ago

Tests are now passing.

Ran benchmarks on my machine (i7-8700K overclocked to 4.8 GHz):

$ cargo +nightly --version
cargo 1.56.0-nightly (b51439fd8 2021-08-09)

Current master (0.7.3):

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          50 ns/iter (+/- 2) = 2000 MB/s
test bench3_1000   ... bench:         397 ns/iter (+/- 9) = 2518 MB/s
test bench4_10000  ... bench:       3,889 ns/iter (+/- 168) = 2571 MB/s
test bench5_100000 ... bench:      38,739 ns/iter (+/- 1,431) = 2581 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:          12 ns/iter (+/- 0) = 833 MB/s
test bench2_100    ... bench:          72 ns/iter (+/- 3) = 1388 MB/s
test bench3_1000   ... bench:         614 ns/iter (+/- 30) = 1628 MB/s
test bench4_10000  ... bench:       5,959 ns/iter (+/- 244) = 1678 MB/s
test bench5_100000 ... bench:      59,545 ns/iter (+/- 1,724) = 1679 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          38 ns/iter (+/- 1) = 2631 MB/s
test bench3_1000   ... bench:         295 ns/iter (+/- 8) = 3389 MB/s
test bench4_10000  ... bench:       2,844 ns/iter (+/- 108) = 3516 MB/s
test bench5_100000 ... bench:      28,393 ns/iter (+/- 2,068) = 3521 MB/s

This PR:

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          37 ns/iter (+/- 2) = 2702 MB/s
test bench3_1000   ... bench:         285 ns/iter (+/- 12) = 3508 MB/s
test bench4_10000  ... bench:       2,691 ns/iter (+/- 137) = 3716 MB/s
test bench5_100000 ... bench:      26,804 ns/iter (+/- 1,187) = 3730 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          52 ns/iter (+/- 3) = 1923 MB/s
test bench3_1000   ... bench:         432 ns/iter (+/- 23) = 2314 MB/s
test bench4_10000  ... bench:       4,126 ns/iter (+/- 133) = 2423 MB/s
test bench5_100000 ... bench:      41,191 ns/iter (+/- 1,258) = 2427 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          31 ns/iter (+/- 1) = 3225 MB/s
test bench3_1000   ... bench:         211 ns/iter (+/- 12) = 4739 MB/s
test bench4_10000  ... bench:       1,978 ns/iter (+/- 101) = 5055 MB/s
test bench5_100000 ... bench:      19,835 ns/iter (+/- 759) = 5041 MB/s

dhardy commented 3 years ago

Looks like you've answered your question, but you'd better ask @kazcw.

RustCrypto / stream-ciphers

`chacha20`: Process 4 blocks at a time in AVX2 backend #267