Closed str4d closed 3 years ago
Benchmark environment (ODROID-N2):
$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 3
Socket(s): 2
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1896.0000
CPU min MHz: 100.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
$ cargo +nightly --version
cargo 1.56.0-nightly (f559c109c 2021-08-26)
Current master:
$ cargo +nightly bench -p chacha20
Running unittests (target/release/deps/chacha12-206bfb5dd069506d)
test bench1_10 ... bench: 57 ns/iter (+/- 0) = 175 MB/s
test bench2_100 ... bench: 465 ns/iter (+/- 2) = 215 MB/s
test bench3_1000 ... bench: 5,773 ns/iter (+/- 16) = 173 MB/s
test bench4_10000 ... bench: 58,764 ns/iter (+/- 179) = 170 MB/s
test bench5_100000 ... bench: 588,846 ns/iter (+/- 2,404) = 169 MB/s
Running unittests (target/release/deps/chacha20-8b7c91f66bd44380)
test bench1_10 ... bench: 74 ns/iter (+/- 0) = 135 MB/s
test bench2_100 ... bench: 634 ns/iter (+/- 2) = 157 MB/s
test bench3_1000 ... bench: 7,472 ns/iter (+/- 6) = 133 MB/s
test bench4_10000 ... bench: 75,752 ns/iter (+/- 91) = 132 MB/s
test bench5_100000 ... bench: 758,864 ns/iter (+/- 888) = 131 MB/s
Running unittests (target/release/deps/chacha8-f812fddd0a2f7553)
test bench1_10 ... bench: 50 ns/iter (+/- 0) = 200 MB/s
test bench2_100 ... bench: 380 ns/iter (+/- 1) = 263 MB/s
test bench3_1000 ... bench: 4,918 ns/iter (+/- 4) = 203 MB/s
test bench4_10000 ... bench: 50,220 ns/iter (+/- 88) = 199 MB/s
test bench5_100000 ... bench: 503,601 ns/iter (+/- 1,495) = 198 MB/s
This PR:
$ cargo +nightly bench --features nightly -p chacha20
Running unittests (target/release/deps/chacha12-92e41f0468c669c0)
test bench1_10 ... bench: 54 ns/iter (+/- 1) = 185 MB/s
test bench2_100 ... bench: 264 ns/iter (+/- 1) = 378 MB/s
test bench3_1000 ... bench: 2,130 ns/iter (+/- 4) = 469 MB/s
test bench4_10000 ... bench: 20,222 ns/iter (+/- 31) = 494 MB/s
test bench5_100000 ... bench: 204,457 ns/iter (+/- 332) = 489 MB/s
Running unittests (target/release/deps/chacha20-7fc4cc5d1c2fd949)
test bench1_10 ... bench: 66 ns/iter (+/- 1) = 151 MB/s
test bench2_100 ... bench: 373 ns/iter (+/- 1) = 268 MB/s
test bench3_1000 ... bench: 3,226 ns/iter (+/- 15) = 309 MB/s
test bench4_10000 ... bench: 31,221 ns/iter (+/- 87) = 320 MB/s
test bench5_100000 ... bench: 314,620 ns/iter (+/- 684) = 317 MB/s
Running unittests (target/release/deps/chacha8-218e8a820d75dd02)
test bench1_10 ... bench: 49 ns/iter (+/- 0) = 204 MB/s
test bench2_100 ... bench: 209 ns/iter (+/- 1) = 478 MB/s
test bench3_1000 ... bench: 1,582 ns/iter (+/- 3) = 632 MB/s
test bench4_10000 ... bench: 14,735 ns/iter (+/- 15) = 678 MB/s
test bench5_100000 ... bench: 149,421 ns/iter (+/- 304) = 669 MB/s
This backend should also work for target_arch = "arm"
, but that requires support in cpufeatures
for checking NEON support at runtime, and then a fallback to a 4-block wrapper around the soft impl.
Processes four blocks in parallel. Adapted from the SUPERCOP
dolbeau
backend (public domain).Placed behind a new
nightly
feature flag, as aarch64 SIMD intrinsics are not yet stable.