BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
5.23k stars 354 forks source link

Any more optimizing space with intel SIMD #137

Open mayujsw opened 4 years ago

mayujsw commented 4 years ago

BLAKE3 performance is really so impressive, just wondering that besides current AVX512, whether there is any more optimizing space based on intel SIMD, like existing cryptographic related instructions. Thanks

oconnor663 commented 3 years ago

Not that I'm aware of in common use cases, but Samuel is the expert on this. In my head, the known areas of missing optimizations are:

oconnor663 commented 3 years ago

I should add that while "produce 1 GB of output" is a very rare use case today, once it's fully optimized it might be tempting for someone to use BLAKE3 as a stream cipher, and maybe then it would become less rare :)

sharifib commented 2 years ago

FYI, the missing SIMD output optimizations seem to be a bottleneck in applications like LtHash

LtHash takes a set of arbitrarily long elements as input, and produces a 2KB hash value as output. Two LtHash outputs can be “added” by breaking up each output into 16-bit chunks and performing component-wise vector addition modulo 2^16

Basically, individual item hashers produce 2KB outputs (4KB in LtHash32 which uses 32-bit chunks) which are treated as vectors to combine them homomorphically. For LtHash32 benchmarking, I've measured over 90% of the total runtime being spent in the output loop.

oconnor663 commented 2 years ago

@sharifib until these XOF optimizations get added, a reasonable workaround could be to use BLAKE3 to generate a regular 32-byte hash, and to then use that hash as a stream key for ChaCha20 on the output side. Any fast ChaCha implementation already has the SIMD optimizations we're discussing here. This would also give you the option of using ChaCha12 or ChaCha8, depending on your opinion about their security margins. BLAKE has around the same security margin in this use case as a hypothetical "ChaCha14".

On Wed, Jun 8, 2022 at 8:58 AM sharifib @.***> wrote:

FYI, the missing SIMD output optimizations seem to be a bottleneck in applications like LtHash https://engineering.fb.com/2019/03/01/security/homomorphic-hashing/#:~:text=LtHash%20takes%20a%20set%20of%20arbitrarily%20long%20elements%20as%20input%2C%20and%20produces%20a%202KB%20hash%20value%20as%20output.%20Two%20LtHash%20outputs%20can%20be%20%E2%80%9Cadded%E2%80%9D%20by%20breaking%20up%20each%20output%20into%2016%2Dbit%20chunks%20and%20performing%20component%2Dwise%20vector%20addition%20modulo%20216.

LtHash takes a set of arbitrarily long elements as input, and produces a 2KB hash value as output. Two LtHash outputs can be “added” by breaking up each output into 16-bit chunks and performing component-wise vector addition modulo 2^16

Basically, individual item hashers produce 2KB outputs (4KB in LtHash32 which uses 32-bit chunks) which are treated as vectors to combine them homomorphically. For LtHash32 benchmarking, I've measured over 90% of the total runtime being spent in the output loop.

— Reply to this email directly, view it on GitHub https://github.com/BLAKE3-team/BLAKE3/issues/137#issuecomment-1150104656, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGSGBDLXFVLFNJGAMKDUKTVOC7L7ANCNFSM4UG637OA . You are receiving this because you commented.Message ID: @.***>