Open riptl opened 9 months ago
Hashing many very short (< 128 byte) messages is a very important use-case for Merkle-trees and Proof-of-Work, both are major compute bottlenecks in transparent SNARKs (e.g. FRI/Ligero/Basefold like).
It looks like the docs-hidden api Platform::hash_many
can support this? I will try this.
@recmo Please note the hash_many
API is used to hash multiple chunks, not multiple messages. Internally, hash_many distributes chunks across SIMD lanes for high byte-per-cycle throughput. A chunk is a leaf node in the BLAKE3 hash tree. Messages smaller than 1024 bytes have exactly one chunk. However, chunks are made up of a variable number of blocks (ceil(data_sz/64)
). Finally, hash_many
requires that each provided chunk has the same number of blocks.
So, hash_many
will work for your use-case only if each message in your batch has the same number of 64-byte blocks (ceil(message_sz/64)
). Because this is a lower-level API, you'd need to figure out the flags and counter arguments yourself from the BLAKE3 paper.
My BLAKE3 rewrite linked above is more flexible:
hash_many
function supports lane masking, allowing you to hash chunks with different number of blocks, e.g. if you have a 128 byte message and a 140 byte message in the same batch (costly under AVX2 but effectively zero cost under AVX512) Happy to dust it off and get it up to speed if needed.
cc @zookozcash is there any renewed interest in supporting batched hashing of messages with different sizes?
Problem
The current blake3 crate leaves a lot of single core performance on the table for message sizes below 8 KiB.
Namely, it doesn't SIMD parallelize hashing for small messages.
As a PoC, I've rewritten a BLAKE3 scheduler from scratch with a modified AVX2 backend: https://github.com/firedancer-io/firedancer/tree/ripatel/fd_blake3/src/ballet/blake3
When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.
I would like to contribute back my changes to this official library. My code is Apache-2.0 licensed, so feel free to copy from it.
Suggested Changes
There are three major pieces required:
log2(chunk_cnt) * simd_degree * 32
working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>