Consider multi-message batching

riptl commented 9 months ago

Problem

The current blake3 crate leaves a lot of single core performance on the table for message sizes below 8 KiB.

Namely, it doesn't SIMD parallelize hashing for small messages.

As a PoC, I've rewritten a BLAKE3 scheduler from scratch with a modified AVX2 backend: https://github.com/firedancer-io/firedancer/tree/ripatel/fd_blake3/src/ballet/blake3

When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.

I would like to contribute back my changes to this official library. My code is Apache-2.0 licensed, so feel free to copy from it.

Suggested Changes

There are three major pieces required:

Adapt the SIMD backends to do work off each lane independently, namely:
- Support lane masking (if one AVX lane finishes hashing before another does) -- Currently, all lanes are always active
- Support independent chunk counters and flags -- the current backend assumes a contiguous range of chunks or parents
Adapt the scheduler to dispatch operations from in-flight hash calcuations to the SIMD backend concurrently
- I'm not sure if the hash tree scheduling algorithm proposed in the BLAKE3 paper is capable of doing so.
- When queueing operations for an in-flight hash state, and we're unable to have enough chunks to hash in parallel to meet the SIMD degree, the algorithm should yield to the next in-flight hash state, before actually starting to hash.
- I've rewritten the scheduler from scratch, but it requires log2(chunk_cnt) * simd_degree * 32 working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.

Adapt the high-level API to tell the scheduler if there are multiple in-flight hash states

The simplest way is a new function call: fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>

Another way is to use thread-local storage to keep track of streaming operations on the current thread

s1 := Blake3::new()
s2 := Blake3::new()
s1.append("abcd");  // registers this append operation as a thread-local
s2.append("1234"): 
hash2 := s2.fini();  // finds that s1 is also queued via thread-locals, so hashes both s1 and s2
hash1 := s1.fini();  // no-op! the result is already available

recmo commented 2 months ago

Hashing many very short (< 128 byte) messages is a very important use-case for Merkle-trees and Proof-of-Work, both are major compute bottlenecks in transparent SNARKs (e.g. FRI/Ligero/Basefold like).

It looks like the docs-hidden api Platform::hash_many can support this? I will try this.

ripatel-fd commented 2 months ago

@recmo Please note the hash_many API is used to hash multiple chunks, not multiple messages. Internally, hash_many distributes chunks across SIMD lanes for high byte-per-cycle throughput. A chunk is a leaf node in the BLAKE3 hash tree. Messages smaller than 1024 bytes have exactly one chunk. However, chunks are made up of a variable number of blocks (ceil(data_sz/64)). Finally, hash_many requires that each provided chunk has the same number of blocks.

So, hash_many will work for your use-case only if each message in your batch has the same number of 64-byte blocks (ceil(message_sz/64)). Because this is a lower-level API, you'd need to figure out the flags and counter arguments yourself from the BLAKE3 paper.

My BLAKE3 rewrite linked above is more flexible:

My equivalent of the hash_many function supports lane masking, allowing you to hash chunks with different number of blocks, e.g. if you have a 128 byte message and a 140 byte message in the same batch (costly under AVX2 but effectively zero cost under AVX512)
It provides a high-level API and generic scheduler for batch hashing for arbitrary message sizes where you don't have to hand-specify flags, making it useful for >1024 byte messages (the scheduling logic makes it a bit slower)

Happy to dust it off and get it up to speed if needed.

ripatel-fd commented 2 months ago

cc @zookozcash is there any renewed interest in supporting batched hashing of messages with different sizes?

BLAKE3-team / BLAKE3

Consider multi-message batching #386