RustCrypto / stream-ciphers

Collection of stream cipher algorithms
255 stars 49 forks source link

No performance improvements with `-Ctarget-feature=+neon` on `arm64` #360

Closed Slixe closed 3 months ago

Slixe commented 3 months ago

Hey there, thanks for this amazing repo!

I'm currently using chacha20 crate, unfortunately, on arm64 CPU I'm getting poor performances with and without -Ctarget-feature=+neon.

test chacha12_bench1_16b   ... bench:          34.15 ns/iter (+/- 1.03) = 470 MB/s
test chacha12_bench2_256b  ... bench:         417.99 ns/iter (+/- 16.83) = 613 MB/s
test chacha12_bench3_1kib  ... bench:       1,653.14 ns/iter (+/- 39.99) = 619 MB/s
test chacha12_bench4_16kib ... bench:      26,417.67 ns/iter (+/- 919.50) = 620 MB/s
test chacha20_bench1_16b   ... bench:          48.31 ns/iter (+/- 0.44) = 333 MB/s
test chacha20_bench2_256b  ... bench:         704.38 ns/iter (+/- 53.47) = 363 MB/s
test chacha20_bench3_1kib  ... bench:       2,710.32 ns/iter (+/- 109.84) = 377 MB/s
test chacha20_bench4_16kib ... bench:      43,389.69 ns/iter (+/- 2,237.09) = 377 MB/s
test chacha8_bench1_16b    ... bench:          26.71 ns/iter (+/- 2.61) = 615 MB/s
test chacha8_bench2_256b   ... bench:         284.31 ns/iter (+/- 12.14) = 901 MB/s
test chacha8_bench3_1kib   ... bench:       1,121.32 ns/iter (+/- 50.64) = 913 MB/s
test chacha8_bench4_16kib  ... bench:      17,929.43 ns/iter (+/- 378.40) = 913 MB/s

Am I missing something ? Thanks in advance :)

newpavlov commented 3 months ago

With chacha20 v0.9 you need to also provide the chacha20_force_neon configuration flag described in the docs (note that it's NOT a crate feature). In v0.10 it probably will work as expected without providing this configuration flag (though we haven't implemented it yet).

Slixe commented 3 months ago

Mea culpa for missing that! I was using a clone version (so v0.10.0-pre) and expected to work directly.

I just followed the README.md from https://crates.io/crates/chacha20

With chacha20_force_neon I get the following now:

test chacha12_bench1_16b   ... bench:          16.14 ns/iter (+/- 0.73) = 1000 MB/s
test chacha12_bench2_256b  ... bench:         111.41 ns/iter (+/- 3.24) = 2306 MB/s
test chacha12_bench3_1kib  ... bench:         440.84 ns/iter (+/- 11.52) = 2327 MB/s
test chacha12_bench4_16kib ... bench:       7,054.65 ns/iter (+/- 344.24) = 2322 MB/s
test chacha20_bench1_16b   ... bench:          29.74 ns/iter (+/- 1.23) = 551 MB/s
test chacha20_bench2_256b  ... bench:         187.54 ns/iter (+/- 4.40) = 1368 MB/s
test chacha20_bench3_1kib  ... bench:         746.00 ns/iter (+/- 6.39) = 1372 MB/s
test chacha20_bench4_16kib ... bench:      11,934.89 ns/iter (+/- 130.76) = 1372 MB/s
test chacha8_bench1_16b    ... bench:          11.72 ns/iter (+/- 0.46) = 1454 MB/s
test chacha8_bench2_256b   ... bench:          73.39 ns/iter (+/- 2.77) = 3506 MB/s
test chacha8_bench3_1kib   ... bench:         294.11 ns/iter (+/- 9.41) = 3482 MB/s
test chacha8_bench4_16kib  ... bench:       4,701.54 ns/iter (+/- 167.79) = 3485 MB/s

I'm asking maybe too much but, can't the feature be directly detected and enabled ?

newpavlov commented 3 months ago

Adding compile-time detection should be relatively easy and we should do it for v0.10 (we essentially just need to tweak our cfgs a bit). This should be sufficient in practice for AArch64 as discussed here.

tarcieri commented 3 months ago

It was previously nightly-only but looks like it works on stable now, so yeah, we should be able to enable it unconditionally on aarch64 targets

Slixe commented 3 months ago

Great to hear! Well, I have everything I need, thanks a lot for your hard work guys, I close the issue.

tarcieri commented 3 months ago

Here's a PR to enable it by default: https://github.com/RustCrypto/stream-ciphers/pull/361