BLAKE3-team / BLAKE3-specs

The BLAKE3 paper: specifications, analysis, and design rationale
https://blake3.io
Other
163 stars 9 forks source link

How can I tell if Blake3 is multi-threaded. #5

Closed SteveBattista closed 3 years ago

SteveBattista commented 3 years ago

Dear team, I'm working on a simple took to test the speed of Blake3 vs the SHA implementations. I have this code saved to GitHub. While the code is just as fast as b3sum, my HTOP is not showing that multiple threads are being engaged. Could it be that it is that fast? Could you spend a second looking at my Cargo.toml to make sure that I have the proper features for multi-threading? Thank you in advance.

https://github.com/SteveBattista/hash_test

oconnor663 commented 3 years ago

Your code is single-threaded. I can see this because you're calling blake3::Hasher::update here. If you take a look at the update docs, you'll read:

This method is always single-threaded. For multi-threading support, see update_with_join below.

Note that the degree of SIMD parallelism that update can use is limited by the size of this input buffer. The 8 KiB buffer currently used by std::io::copy is enough to leverage AVX2, for example, but not enough to leverage AVX-512. A 16 KiB buffer is large enough to leverage all currently supported SIMD instruction sets.

The first line there is your answer. But the second part is also very relevant to your use case. It looks like you're using a 512-byte buffer, which is too small to take advantage of most of our SIMD optimizations.

So when you say "the code is just as fast as b3sum", that's not what I'd expect you to see. This could be for a few different reasons:

Speaking of running benchmarks in a loop, I notice that here you're running each hash function one time back-to-back. Everything depends on the details of what you're doing, but I expect that that's very unlikely to give you meaningful, stable results. For example, in the common case where you're hashing some file that's not in cache yet, the first function in your loop is going to look artificially slow, because its performance is capped by disk read speed, while the subsequent loop iterations on the same file are not. Even if you avoid that issue, there are lots of sources of CPU performance noise that you have to worry about, and you'll find most benchmark suites do an average over many runs to try to account for that.

SteveBattista commented 3 years ago

Thank you so much for the tip. I think I have fixed all of the issues. I'm getting about 2.8GBs on my development machine (specs in the read me.

oconnor663 commented 3 years ago

These numbers look plausible, yes. Note that your CPU does not support AVX2 (or the much more recent AVX-512), so you can expect this long-input benchmark measurement to double or triple on a more recent machine.