RustCrypto / stream-ciphers

Collection of stream cipher algorithms
264 stars 50 forks source link

chacha20: Add NEON implementation for aarch64 #274

Closed str4d closed 3 years ago

str4d commented 3 years ago

Processes four blocks in parallel. Adapted from the SUPERCOP dolbeau backend (public domain).

Placed behind a new nightly feature flag, as aarch64 SIMD intrinsics are not yet stable.

str4d commented 3 years ago

Benchmark environment (ODROID-N2):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  3
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1896.0000
CPU min MHz:         100.0000
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32
$ cargo +nightly --version
cargo 1.56.0-nightly (f559c109c 2021-08-26)

Current master:

$ cargo +nightly bench -p chacha20
     Running unittests (target/release/deps/chacha12-206bfb5dd069506d)
test bench1_10     ... bench:          57 ns/iter (+/- 0) = 175 MB/s
test bench2_100    ... bench:         465 ns/iter (+/- 2) = 215 MB/s
test bench3_1000   ... bench:       5,773 ns/iter (+/- 16) = 173 MB/s
test bench4_10000  ... bench:      58,764 ns/iter (+/- 179) = 170 MB/s
test bench5_100000 ... bench:     588,846 ns/iter (+/- 2,404) = 169 MB/s

     Running unittests (target/release/deps/chacha20-8b7c91f66bd44380)
test bench1_10     ... bench:          74 ns/iter (+/- 0) = 135 MB/s
test bench2_100    ... bench:         634 ns/iter (+/- 2) = 157 MB/s
test bench3_1000   ... bench:       7,472 ns/iter (+/- 6) = 133 MB/s
test bench4_10000  ... bench:      75,752 ns/iter (+/- 91) = 132 MB/s
test bench5_100000 ... bench:     758,864 ns/iter (+/- 888) = 131 MB/s

     Running unittests (target/release/deps/chacha8-f812fddd0a2f7553)
test bench1_10     ... bench:          50 ns/iter (+/- 0) = 200 MB/s
test bench2_100    ... bench:         380 ns/iter (+/- 1) = 263 MB/s
test bench3_1000   ... bench:       4,918 ns/iter (+/- 4) = 203 MB/s
test bench4_10000  ... bench:      50,220 ns/iter (+/- 88) = 199 MB/s
test bench5_100000 ... bench:     503,601 ns/iter (+/- 1,495) = 198 MB/s

This PR:

$ cargo +nightly bench --features nightly -p chacha20
     Running unittests (target/release/deps/chacha12-92e41f0468c669c0)
test bench1_10     ... bench:          54 ns/iter (+/- 1) = 185 MB/s
test bench2_100    ... bench:         264 ns/iter (+/- 1) = 378 MB/s
test bench3_1000   ... bench:       2,130 ns/iter (+/- 4) = 469 MB/s
test bench4_10000  ... bench:      20,222 ns/iter (+/- 31) = 494 MB/s
test bench5_100000 ... bench:     204,457 ns/iter (+/- 332) = 489 MB/s

     Running unittests (target/release/deps/chacha20-7fc4cc5d1c2fd949)
test bench1_10     ... bench:          66 ns/iter (+/- 1) = 151 MB/s
test bench2_100    ... bench:         373 ns/iter (+/- 1) = 268 MB/s
test bench3_1000   ... bench:       3,226 ns/iter (+/- 15) = 309 MB/s
test bench4_10000  ... bench:      31,221 ns/iter (+/- 87) = 320 MB/s
test bench5_100000 ... bench:     314,620 ns/iter (+/- 684) = 317 MB/s

     Running unittests (target/release/deps/chacha8-218e8a820d75dd02)
test bench1_10     ... bench:          49 ns/iter (+/- 0) = 204 MB/s
test bench2_100    ... bench:         209 ns/iter (+/- 1) = 478 MB/s
test bench3_1000   ... bench:       1,582 ns/iter (+/- 3) = 632 MB/s
test bench4_10000  ... bench:      14,735 ns/iter (+/- 15) = 678 MB/s
test bench5_100000 ... bench:     149,421 ns/iter (+/- 304) = 669 MB/s
str4d commented 3 years ago

This backend should also work for target_arch = "arm", but that requires support in cpufeatures for checking NEON support at runtime, and then a fallback to a 4-block wrapper around the soft impl.