RustCrypto / stream-ciphers

Collection of stream cipher algorithms
264 stars 50 forks source link

chacha20: Add full NEON backend. #310

Closed codahale closed 1 year ago

codahale commented 2 years ago

I ported the NEON implementation of ChaCha from Crypto++ (public domain) to Rust with aarch64 intrinsics for a significant performance boost.

Observed performance changes on an Apple M1 Air:

 name                   chacha-soft ns/iter  chacha-neon ns/iter  diff ns/iter   diff %  speedup
 chacha12_bench1_16b    34 (470 MB/s)        28 (571 MB/s)                  -6  -17.65%   x 1.21
 chacha12_bench2_256b   477 (536 MB/s)       132 (1939 MB/s)              -345  -72.33%   x 3.61
 chacha12_bench3_1kib   1,897 (539 MB/s)     503 (2035 MB/s)            -1,394  -73.48%   x 3.77
 chacha12_bench4_16kib  30,811 (531 MB/s)    7,914 (2070 MB/s)         -22,897  -74.31%   x 3.89
 chacha20_bench1_16b    51 (313 MB/s)        47 (340 MB/s)                  -4   -7.84%   x 1.09
 chacha20_bench2_256b   777 (329 MB/s)       212 (1207 MB/s)              -565  -72.72%   x 3.67
 chacha20_bench3_1kib   3,088 (331 MB/s)     821 (1247 MB/s)            -2,267  -73.41%   x 3.76
 chacha20_bench4_16kib  50,251 (326 MB/s)    13,001 (1260 MB/s)        -37,250  -74.13%   x 3.87
 chacha8_bench1_16b     26 (615 MB/s)        19 (842 MB/s)                  -7  -26.92%   x 1.37
 chacha8_bench2_256b    335 (764 MB/s)       92 (2782 MB/s)               -243  -72.54%   x 3.64
 chacha8_bench3_1kib    1,328 (771 MB/s)     344 (2976 MB/s)              -984  -74.10%   x 3.86
 chacha8_bench4_16kib   21,184 (773 MB/s)    5,371 (3050 MB/s)         -15,813  -74.65%   x 3.94

Closes #287.

I’m not entirely certain I’ve got the flag/feature/cpuid token stuff right, and would appreciate any guidance about that. At this point the best I’ve got is that it very definitely works on my machine, an M1 Air. Also, no idea how one runs GitHub Actions on a aarch64/NEON platform. QEMU?

codahale commented 2 years ago

Welp, didn’t realize that vreinterpretq_u64_u32 was nightly-only. I guess feel free to revive this when/if stable SIMD lands. ☹️

tarcieri commented 2 years ago

no idea how one runs GitHub Actions on a aarch64/NEON platform. QEMU?

As it were we have another PR open to add ARM support to the keccak crate which covers a bunch of this: https://github.com/RustCrypto/sponges/pull/23

You can use cross. Here's a example: https://github.com/RustCrypto/block-ciphers/blob/4334b85/.github/workflows/aes.yml#L222-L249

Unfortunately there's no M1-specific solution until this lands: https://github.com/github/roadmap/issues/528

Welp, didn’t realize that vreinterpretq_u64_u32 was nightly-only.

FWIW we'd be fine with a nightly-only feature. We have similar nightly-only ARM features in the aes and polyval crates which wrap ARMv8 hardware intrinsics supported by M1s.

I guess feel free to revive this when/if stable SIMD lands.

Sure looking forward to the eventual stabilization of core::simd and the day we can have a portable SIMD implementation of ChaCha.

codahale commented 2 years ago

FWIW, this code compiles and passes tests on my M1 Air using stable-aarch64-apple-darwin 1.64.0. I don’t know what to make of that. Any suggestions?

tarcieri commented 2 years ago

Also regarding this specifically:

...vreinterpretq_u64_u32 was nightly-only.

There are various stable workarounds, such as using transmute, pointer casts, or core::slice::from_raw_parts.

I'm guessing that's not the only nightly-only NEON intrinsics support you need, though.

tarcieri commented 2 years ago

this code compiles and passes tests on my M1 Air using stable-aarch64-apple-darwin 1.64.0

Maybe all the intrinsics you need are stabilized on 1.64?

If that's the case, you can just feature-gate support to prevent MSRV breakages.

It'd be good to know the actual MSRV of the feature.

codahale commented 2 years ago

Ok, looks like vreinterpretq_u64_u32 landed in 1.61. I re-added the neon feature to handle the MSRV issue and added target_feature gates for the backend.

codahale commented 2 years ago

Enabled CI (using the existing but commented-out cross matrix item) and updated the README.

I think this is good to go. Let me know if anything else needs changing.

tarcieri commented 2 years ago

Cool, will try to review this week sometime

tarcieri commented 1 year ago

This otherwise seems like a fairly straightforward port of chacha_simd and I'm fine to merge it if we can figure out how it should be gated.

tarcieri commented 1 year ago

Going to merge this and follow up on how it should be gated

Mygod commented 1 year ago

Could you make a new release with this PR?

tarcieri commented 1 year ago

Unfortunately I haven’t added the gating I suggested which should be implemented prior to a release