Open mayujsw opened 4 years ago
Not that I'm aware of in common use cases, but Samuel is the expert on this. In my head, the known areas of missing optimizations are:
b3sum
's default behavior, assuming the performance is close enough.I should add that while "produce 1 GB of output" is a very rare use case today, once it's fully optimized it might be tempting for someone to use BLAKE3 as a stream cipher, and maybe then it would become less rare :)
FYI, the missing SIMD output optimizations seem to be a bottleneck in applications like LtHash
LtHash takes a set of arbitrarily long elements as input, and produces a 2KB hash value as output. Two LtHash outputs can be “added” by breaking up each output into 16-bit chunks and performing component-wise vector addition modulo 2^16
Basically, individual item hashers produce 2KB outputs (4KB in LtHash32 which uses 32-bit chunks) which are treated as vectors to combine them homomorphically. For LtHash32 benchmarking, I've measured over 90% of the total runtime being spent in the output loop.
@sharifib until these XOF optimizations get added, a reasonable workaround could be to use BLAKE3 to generate a regular 32-byte hash, and to then use that hash as a stream key for ChaCha20 on the output side. Any fast ChaCha implementation already has the SIMD optimizations we're discussing here. This would also give you the option of using ChaCha12 or ChaCha8, depending on your opinion about their security margins. BLAKE has around the same security margin in this use case as a hypothetical "ChaCha14".
On Wed, Jun 8, 2022 at 8:58 AM sharifib @.***> wrote:
FYI, the missing SIMD output optimizations seem to be a bottleneck in applications like LtHash https://engineering.fb.com/2019/03/01/security/homomorphic-hashing/#:~:text=LtHash%20takes%20a%20set%20of%20arbitrarily%20long%20elements%20as%20input%2C%20and%20produces%20a%202KB%20hash%20value%20as%20output.%20Two%20LtHash%20outputs%20can%20be%20%E2%80%9Cadded%E2%80%9D%20by%20breaking%20up%20each%20output%20into%2016%2Dbit%20chunks%20and%20performing%20component%2Dwise%20vector%20addition%20modulo%20216.
LtHash takes a set of arbitrarily long elements as input, and produces a 2KB hash value as output. Two LtHash outputs can be “added” by breaking up each output into 16-bit chunks and performing component-wise vector addition modulo 2^16
Basically, individual item hashers produce 2KB outputs (4KB in LtHash32 which uses 32-bit chunks) which are treated as vectors to combine them homomorphically. For LtHash32 benchmarking, I've measured over 90% of the total runtime being spent in the output loop.
— Reply to this email directly, view it on GitHub https://github.com/BLAKE3-team/BLAKE3/issues/137#issuecomment-1150104656, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGSGBDLXFVLFNJGAMKDUKTVOC7L7ANCNFSM4UG637OA . You are receiving this because you commented.Message ID: @.***>
BLAKE3 performance is really so impressive, just wondering that besides current AVX512, whether there is any more optimizing space based on intel SIMD, like existing cryptographic related instructions. Thanks