BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
4.71k stars 315 forks source link

Consider multi-message batching #386

Open riptl opened 4 months ago

riptl commented 4 months ago

Problem

The current blake3 crate leaves a lot of single core performance on the table for message sizes below 8 KiB.

Namely, it doesn't SIMD parallelize hashing for small messages.

As a PoC, I've rewritten a BLAKE3 scheduler from scratch with a modified AVX2 backend: https://github.com/firedancer-io/firedancer/tree/ripatel/fd_blake3/src/ballet/blake3

When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.

image

I would like to contribute back my changes to this official library. My code is Apache-2.0 licensed, so feel free to copy from it.

Suggested Changes

There are three major pieces required:

  1. Adapt the SIMD backends to do work off each lane independently, namely:
    • Support lane masking (if one AVX lane finishes hashing before another does) -- Currently, all lanes are always active
    • Support independent chunk counters and flags -- the current backend assumes a contiguous range of chunks or parents
  2. Adapt the scheduler to dispatch operations from in-flight hash calcuations to the SIMD backend concurrently
    • I'm not sure if the hash tree scheduling algorithm proposed in the BLAKE3 paper is capable of doing so.
    • When queueing operations for an in-flight hash state, and we're unable to have enough chunks to hash in parallel to meet the SIMD degree, the algorithm should yield to the next in-flight hash state, before actually starting to hash.
    • I've rewritten the scheduler from scratch, but it requires log2(chunk_cnt) * simd_degree * 32 working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.
  3. Adapt the high-level API to tell the scheduler if there are multiple in-flight hash states
    • The simplest way is a new function call: fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>
    • Another way is to use thread-local storage to keep track of streaming operations on the current thread
      s1 := Blake3::new()
      s2 := Blake3::new()
      s1.append("abcd");  // registers this append operation as a thread-local
      s2.append("1234"): 
      hash2 := s2.fini();  // finds that s1 is also queued via thread-locals, so hashes both s1 and s2
      hash1 := s1.fini();  // no-op! the result is already available