Any more optimizing space with intel SIMD

mayujsw commented 4 years ago

BLAKE3 performance is really so impressive, just wondering that besides current AVX512, whether there is any more optimizing space based on intel SIMD, like existing cryptographic related instructions. Thanks

oconnor663 commented 3 years ago

Not that I'm aware of in common use cases, but Samuel is the expert on this. In my head, the known areas of missing optimizations are:

The C implementation is not multithreaded.
The Rust implementation supports multithreading, but currently it works best in an all-at-once approach, like hashing an entire memory-mapped file in one go. An interleaved-from-the-front approach is also possible, with all the worker threads taking chunks from the start of the input in an alternating fashion, but we haven't implemented it yet. That would probably require some extra memory allocation, but it would solve some thrashing issues on spinning disks, and it might be easier than memory mapping for some applications to deploy. If/when we do implement this, it will probably become b3sum's default behavior, assuming the performance is close enough.
Extended outputs don't take full advantage of SIMD. When you hash a 1 GB file, we use whatever max-width SIMD vectors your processor supports, but if you want to produce 1 GB of output, those same optimizations don't get used. Instead the implementation will just call the one-block compression function in a loop. There's no particular barrier to vectorizing output just like we've done with input, but it's a very rare use case, and we haven't gotten around to it.

oconnor663 commented 3 years ago

I should add that while "produce 1 GB of output" is a very rare use case today, once it's fully optimized it might be tempting for someone to use BLAKE3 as a stream cipher, and maybe then it would become less rare :)

sharifib commented 2 years ago

FYI, the missing SIMD output optimizations seem to be a bottleneck in applications like LtHash

LtHash takes a set of arbitrarily long elements as input, and produces a 2KB hash value as output. Two LtHash outputs can be “added” by breaking up each output into 16-bit chunks and performing component-wise vector addition modulo 2^16

Basically, individual item hashers produce 2KB outputs (4KB in LtHash32 which uses 32-bit chunks) which are treated as vectors to combine them homomorphically. For LtHash32 benchmarking, I've measured over 90% of the total runtime being spent in the output loop.

oconnor663 commented 2 years ago

@sharifib until these XOF optimizations get added, a reasonable workaround could be to use BLAKE3 to generate a regular 32-byte hash, and to then use that hash as a stream key for ChaCha20 on the output side. Any fast ChaCha implementation already has the SIMD optimizations we're discussing here. This would also give you the option of using ChaCha12 or ChaCha8, depending on your opinion about their security margins. BLAKE has around the same security margin in this use case as a hypothetical "ChaCha14".

On Wed, Jun 8, 2022 at 8:58 AM sharifib @.***> wrote:

FYI, the missing SIMD output optimizations seem to be a bottleneck in applications like LtHash https://engineering.fb.com/2019/03/01/security/homomorphic-hashing/#:~:text=LtHash%20takes%20a%20set%20of%20arbitrarily%20long%20elements%20as%20input%2C%20and%20produces%20a%202KB%20hash%20value%20as%20output.%20Two%20LtHash%20outputs%20can%20be%20%E2%80%9Cadded%E2%80%9D%20by%20breaking%20up%20each%20output%20into%2016%2Dbit%20chunks%20and%20performing%20component%2Dwise%20vector%20addition%20modulo%20216.

LtHash takes a set of arbitrarily long elements as input, and produces a 2KB hash value as output. Two LtHash outputs can be “added” by breaking up each output into 16-bit chunks and performing component-wise vector addition modulo 2^16

Basically, individual item hashers produce 2KB outputs (4KB in LtHash32 which uses 32-bit chunks) which are treated as vectors to combine them homomorphically. For LtHash32 benchmarking, I've measured over 90% of the total runtime being spent in the output loop.

— Reply to this email directly, view it on GitHub https://github.com/BLAKE3-team/BLAKE3/issues/137#issuecomment-1150104656, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGSGBDLXFVLFNJGAMKDUKTVOC7L7ANCNFSM4UG637OA . You are receiving this because you commented.Message ID: @.***>

BLAKE3-team / BLAKE3

Any more optimizing space with intel SIMD #137