sha2: Add aarch64 backends for SHA2.

codahale commented 1 year ago

Adds NEON-enabled backends for SHA2 on aarch64.

Eliminates the need for the asm feature on aarch64 for SHA-{224, 256} performance and provides a big performance boost for SHA-512, which didn’t benefit from the asm feature.

Before:

test sha256_10    ... bench:          27 ns/iter (+/- 0) = 370 MB/s
test sha256_100   ... bench:         278 ns/iter (+/- 3) = 359 MB/s
test sha256_1000  ... bench:       2,747 ns/iter (+/- 24) = 364 MB/s
test sha256_10000 ... bench:      27,392 ns/iter (+/- 293) = 365 MB/s
test sha512_10    ... bench:          17 ns/iter (+/- 0) = 588 MB/s
test sha512_100   ... bench:         164 ns/iter (+/- 7) = 609 MB/s
test sha512_1000  ... bench:       1,650 ns/iter (+/- 28) = 606 MB/s
test sha512_10000 ... bench:      16,533 ns/iter (+/- 1,540) = 604 MB/s

After:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 0) = 2173 MB/s
test sha256_1000  ... bench:         424 ns/iter (+/- 6) = 2358 MB/s
test sha256_10000 ... bench:       4,190 ns/iter (+/- 31) = 2386 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 5) = 1572 MB/s
test sha512_10000 ... bench:       6,311 ns/iter (+/- 68) = 1584 MB/s

(Benchmarks run on my M2 Air laptop, unplugged, on my kitchen table.)

tarcieri commented 1 year ago

Neat! I hadn't thought about using inline ASM as a sort of "polyfill" for using unstable intrinsics on stable Rust before.

This is something we should consider doing elsewhere we use unstable aarch64 intrinsics, such as in the aes and polyval crates. Then, when the intrinsics are stabilized, we can delete the inline ASM and switch to the intrinsics.

newpavlov commented 1 year ago

Update the cross version to fix the CI failures.

codahale commented 1 year ago

Update the cross version to fix the CI failures.

What should I bump it to?

newpavlov commented 1 year ago

What should I bump it to?

1.59

newpavlov commented 1 year ago

Can you compare performance after addition of the options? The resulting assembly is somewhat different. Number of instructions is the same, so hopefully it's only reordering which may even improve performance.

codahale commented 1 year ago

No real difference.

Before:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 2) = 2173 MB/s
test sha256_1000  ... bench:         421 ns/iter (+/- 5) = 2375 MB/s
test sha256_10000 ... bench:       4,155 ns/iter (+/- 44) = 2406 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 15) = 1572 MB/s
test sha512_10000 ... bench:       6,317 ns/iter (+/- 54) = 1583 MB/s

After:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 4) = 2173 MB/s
test sha256_1000  ... bench:         423 ns/iter (+/- 10) = 2364 MB/s
test sha256_10000 ... bench:       4,179 ns/iter (+/- 63) = 2392 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 12) = 1572 MB/s
test sha512_10000 ... bench:       6,324 ns/iter (+/- 400) = 1581 MB/s

newpavlov commented 1 year ago

Thank you!

codahale commented 1 year ago

Now that this is merged, what’s the remaining work to drop the sha2_asm dependency?

newpavlov commented 1 year ago

We would need to migrate the x86-64 implementation from it to inline asm. IIRC it's still a bit faster than our software fallback.

codahale commented 1 year ago

I ask b/c I was running some related benchmarks on a GCE n2-standard-4 with Ice Lake and noticed there wasn’t much of a difference:

With asm:

$ RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features=asm

test sha256_10    ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test sha256_100   ... bench:          81 ns/iter (+/- 0) = 1234 MB/s
test sha256_1000  ... bench:         719 ns/iter (+/- 5) = 1390 MB/s
test sha256_10000 ... bench:       7,126 ns/iter (+/- 64) = 1403 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         201 ns/iter (+/- 1) = 497 MB/s
test sha512_1000  ... bench:       1,836 ns/iter (+/- 8) = 544 MB/s
test sha512_10000 ... bench:      17,769 ns/iter (+/- 61) = 562 MB/s

Without asm:

$ RUSTFLAGS="-C target-cpu=native" cargo +nightly bench
test sha256_10    ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test sha256_100   ... bench:          81 ns/iter (+/- 1) = 1234 MB/s
test sha256_1000  ... bench:         718 ns/iter (+/- 4) = 1392 MB/s
test sha256_10000 ... bench:       7,116 ns/iter (+/- 20) = 1405 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         201 ns/iter (+/- 1) = 497 MB/s
test sha512_1000  ... bench:       1,836 ns/iter (+/- 8) = 544 MB/s
test sha512_10000 ... bench:      17,819 ns/iter (+/- 100) = 561 MB/s

Definitely within the margin of error.

Maybe on a different CPU?

newpavlov commented 1 year ago

You are getting results for the SHA-NI and AVX2 backends (the asm backend is treated as a replacement for the software backend, thus it has lower priority). On my laptop after I disabled them I get:

// without asm:
test sha256_10    ... bench:          33 ns/iter (+/- 0) = 303 MB/s
test sha256_100   ... bench:         321 ns/iter (+/- 2) = 311 MB/s
test sha256_1000  ... bench:       3,131 ns/iter (+/- 6) = 319 MB/s
test sha256_10000 ... bench:      31,227 ns/iter (+/- 69) = 320 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         207 ns/iter (+/- 0) = 483 MB/s
test sha512_1000  ... bench:       2,017 ns/iter (+/- 5) = 495 MB/s
test sha512_10000 ... bench:      20,077 ns/iter (+/- 64) = 498 MB/s

// with asm:
test sha256_10    ... bench:          28 ns/iter (+/- 3) = 357 MB/s
test sha256_100   ... bench:         274 ns/iter (+/- 7) = 364 MB/s
test sha256_1000  ... bench:       2,671 ns/iter (+/- 23) = 374 MB/s
test sha256_10000 ... bench:      26,693 ns/iter (+/- 348) = 374 MB/s
test sha512_10    ... bench:          20 ns/iter (+/- 0) = 500 MB/s
test sha512_100   ... bench:         184 ns/iter (+/- 2) = 543 MB/s
test sha512_1000  ... bench:       1,809 ns/iter (+/- 14) = 552 MB/s
test sha512_10000 ... bench:      18,032 ns/iter (+/- 177) = 554 MB/s

codahale commented 1 year ago

Ah, gotcha. Thanks for clearing that up.

RustCrypto / hashes

sha2: Add aarch64 backends for SHA2. #490