Implement much faster sha256 and sha512.

0xdeafbeef commented 3 years ago

I took sha256 and sha512 variants from linux sources. On AMD Ryzen 9 5900HS comparing

cargo bench

with

RUSTFLAGS=-Ctarget-feature=+avx2,+aes cargo bench

gives such results:

sha256                  time:   [31.047 ns 31.065 ns 31.083 ns]                    
                        change: [-79.294% -79.275% -79.257%] (p = 0.00 < 0.05)
                        Performance has improved.

sha512                  time:   [135.58 ns 135.79 ns 136.01 ns]                   
                        change: [-34.078% -33.749% -33.500%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Closes https://github.com/RustCrypto/asm-hashes/issues/5

tarcieri commented 3 years ago

Looks very interesting, thanks! Left some notes.

0xdeafbeef commented 3 years ago

BTW, is Cargo.lock required?

tarcieri commented 3 years ago

BTW, is Cargo.lock required?

What do you mean by that?

0xdeafbeef commented 3 years ago

BTW, is Cargo.lock required?

What do you mean by that? Why Cargo.lock is kept in library? To pin cc version?

tarcieri commented 3 years ago

@0xdeafbeef it makes the build deterministic, which makes it easier to spot problems arising from particular dependency changes.

It's something we do across the board, although perhaps there are repos like this one which it makes less sense for.

tarcieri commented 3 years ago

@0xdeafbeef did you say you compared the core::arch intrinsics version for SHA-NI to the ASM?

If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.

0xdeafbeef commented 3 years ago

@0xdeafbeef did you say you compared the core::arch intrinsics version for SHA-NI to the ASM?

If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.

Speed is the same. I think we should include it because if somebody uses asm feature, then he'll get much slower implementation then without it.

tarcieri commented 3 years ago

Since we already have the intrinsic code in the sha2 crate, we can detect the sha extension there and use it if available, only then falling back onto the asm if it isn't available, i.e. SHA-NI intrinsics should be a higher precedence than asm, which AFAIK is how it already works.

Otherwise, there is duplication of the feature across the sha2 and sha2-asm crates.

tarcieri commented 3 years ago

Hmm, build failure seems unrelated I think?

tarcieri commented 3 years ago

@0xdeafbeef can you rebase? I think #42 should've taken care of the build failures.

newpavlov commented 3 years ago

BTW could you also compare performance of the AVX2 based assembly with the intrinsics-based implementation from RustCrypto/hashes#312?

0xdeafbeef commented 3 years ago

asm

test bench1_10    ... bench:          20 ns/iter (+/- 2) = 500 MB/s
test bench2_100   ... bench:         164 ns/iter (+/- 10) = 609 MB/s
test bench3_1000  ... bench:       1,451 ns/iter (+/- 135) = 689 MB/s
test bench4_10000 ... bench:      14,165 ns/iter (+/- 1,319) = 705 MB/s

intrinsic

running 4 tests
test bench1_10    ... bench:          20 ns/iter (+/- 5) = 500 MB/s
test bench2_100   ... bench:         162 ns/iter (+/- 10) = 617 MB/s
test bench3_1000  ... bench:       1,408 ns/iter (+/- 159) = 710 MB/s
test bench4_10000 ... bench:      13,448 ns/iter (+/- 838) = 743 MB/s

Force soft.

running 4 tests
test bench1_10    ... bench:          23 ns/iter (+/- 4) = 434 MB/s
test bench2_100   ... bench:         196 ns/iter (+/- 23) = 510 MB/s
test bench3_1000  ... bench:       1,926 ns/iter (+/- 144) = 519 MB/s
test bench4_10000 ... bench:      18,350 ns/iter (+/- 1,070) = 544 MB/s

I think that asm version is not needed anymore. Good job, @Rexagon!

0xdeafbeef commented 3 years ago

After pinning to the same core asm

running 4 tests
test bench1_10    ... bench:          19 ns/iter (+/- 0) = 526 MB/s
test bench2_100   ... bench:         152 ns/iter (+/- 3) = 657 MB/s
test bench3_1000  ... bench:       1,339 ns/iter (+/- 28) = 746 MB/s
test bench4_10000 ... bench:      13,041 ns/iter (+/- 343) = 766 MB/s

intrinsic

running 4 tests
test bench1_10    ... bench:          19 ns/iter (+/- 0) = 526 MB/s
test bench2_100   ... bench:         148 ns/iter (+/- 3) = 675 MB/s
test bench3_1000  ... bench:       1,276 ns/iter (+/- 30) = 783 MB/s
test bench4_10000 ... bench:      12,420 ns/iter (+/- 275) = 805 MB/s

@newpavlov should I close pr?

newpavlov commented 3 years ago

Hm, I am not 100% sure. Some may prefer the assembly implementation from reliability point of view, since with an intrinsics-based implementation we at the mercy of the compiler and in some cases achieved performance can be brittle. From another point of view, people usually expect that an assembly implementation is faster than a "software" one.

@tarcieri What do you think?

tarcieri commented 3 years ago

Yeah, it's definitely a tradeoff. I think the biggest risk is actually miscompilation (see e.g. https://github.com/rust-lang/rust/issues/79865).

That said I'd weakly be in favor of an all-intrinsics approach if performance is comparable to assembly. I think that better fits the philosophy of "Rust Crypto", and unless there are big performance wins with ASM it's probably best avoided, at least within the crates we maintain.

A pure Rust approach solves a lot of problems, especially relating to portability. Relevant: https://github.com/RustCrypto/hashes/issues/315

newpavlov commented 3 years ago

I also lean towards the stance "assembly impls only for sufficient performance improvements", so I guess we can close this PR.

@0xdeafbeef Thank you for you contribution (at the very least I think it was a trigger for the AVX2 impl) and sorry this PR ended like this!

RustCrypto / asm-hashes

Implement much faster sha256 and sha512. #41