Closed 0xdeafbeef closed 3 years ago
Looks very interesting, thanks! Left some notes.
BTW, is Cargo.lock
required?
BTW, is Cargo.lock required?
What do you mean by that?
BTW, is Cargo.lock required?
What do you mean by that? Why Cargo.lock is kept in library? To pin cc version?
@0xdeafbeef it makes the build deterministic, which makes it easier to spot problems arising from particular dependency changes.
It's something we do across the board, although perhaps there are repos like this one which it makes less sense for.
@0xdeafbeef did you say you compared the core::arch
intrinsics version for SHA-NI to the ASM?
If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.
@0xdeafbeef did you say you compared the
core::arch
intrinsics version for SHA-NI to the ASM?If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.
Speed is the same. I think we should include it because if somebody uses asm
feature, then he'll get much slower implementation then without it.
Since we already have the intrinsic code in the sha2
crate, we can detect the sha
extension there and use it if available, only then falling back onto the asm
if it isn't available, i.e. SHA-NI intrinsics should be a higher precedence than asm
, which AFAIK is how it already works.
Otherwise, there is duplication of the feature across the sha2
and sha2-asm
crates.
Hmm, build failure seems unrelated I think?
@0xdeafbeef can you rebase? I think #42 should've taken care of the build failures.
BTW could you also compare performance of the AVX2 based assembly with the intrinsics-based implementation from RustCrypto/hashes#312?
asm
test bench1_10 ... bench: 20 ns/iter (+/- 2) = 500 MB/s
test bench2_100 ... bench: 164 ns/iter (+/- 10) = 609 MB/s
test bench3_1000 ... bench: 1,451 ns/iter (+/- 135) = 689 MB/s
test bench4_10000 ... bench: 14,165 ns/iter (+/- 1,319) = 705 MB/s
intrinsic
running 4 tests
test bench1_10 ... bench: 20 ns/iter (+/- 5) = 500 MB/s
test bench2_100 ... bench: 162 ns/iter (+/- 10) = 617 MB/s
test bench3_1000 ... bench: 1,408 ns/iter (+/- 159) = 710 MB/s
test bench4_10000 ... bench: 13,448 ns/iter (+/- 838) = 743 MB/s
Force soft.
running 4 tests
test bench1_10 ... bench: 23 ns/iter (+/- 4) = 434 MB/s
test bench2_100 ... bench: 196 ns/iter (+/- 23) = 510 MB/s
test bench3_1000 ... bench: 1,926 ns/iter (+/- 144) = 519 MB/s
test bench4_10000 ... bench: 18,350 ns/iter (+/- 1,070) = 544 MB/s
I think that asm version is not needed anymore. Good job, @Rexagon!
After pinning to the same core asm
running 4 tests
test bench1_10 ... bench: 19 ns/iter (+/- 0) = 526 MB/s
test bench2_100 ... bench: 152 ns/iter (+/- 3) = 657 MB/s
test bench3_1000 ... bench: 1,339 ns/iter (+/- 28) = 746 MB/s
test bench4_10000 ... bench: 13,041 ns/iter (+/- 343) = 766 MB/s
intrinsic
running 4 tests
test bench1_10 ... bench: 19 ns/iter (+/- 0) = 526 MB/s
test bench2_100 ... bench: 148 ns/iter (+/- 3) = 675 MB/s
test bench3_1000 ... bench: 1,276 ns/iter (+/- 30) = 783 MB/s
test bench4_10000 ... bench: 12,420 ns/iter (+/- 275) = 805 MB/s
@newpavlov should I close pr?
Hm, I am not 100% sure. Some may prefer the assembly implementation from reliability point of view, since with an intrinsics-based implementation we at the mercy of the compiler and in some cases achieved performance can be brittle. From another point of view, people usually expect that an assembly implementation is faster than a "software" one.
@tarcieri What do you think?
Yeah, it's definitely a tradeoff. I think the biggest risk is actually miscompilation (see e.g. https://github.com/rust-lang/rust/issues/79865).
That said I'd weakly be in favor of an all-intrinsics approach if performance is comparable to assembly. I think that better fits the philosophy of "Rust Crypto", and unless there are big performance wins with ASM it's probably best avoided, at least within the crates we maintain.
A pure Rust approach solves a lot of problems, especially relating to portability. Relevant: https://github.com/RustCrypto/hashes/issues/315
I also lean towards the stance "assembly impls only for sufficient performance improvements", so I guess we can close this PR.
@0xdeafbeef Thank you for you contribution (at the very least I think it was a trigger for the AVX2 impl) and sorry this PR ended like this!
I took sha256 and sha512 variants from linux sources. On AMD Ryzen 9 5900HS comparing
with
gives such results:
Closes https://github.com/RustCrypto/asm-hashes/issues/5