How can I tell if Blake3 is multi-threaded.

SteveBattista commented 3 years ago

Dear team, I'm working on a simple took to test the speed of Blake3 vs the SHA implementations. I have this code saved to GitHub. While the code is just as fast as b3sum, my HTOP is not showing that multiple threads are being engaged. Could it be that it is that fast? Could you spend a second looking at my Cargo.toml to make sure that I have the proper features for multi-threading? Thank you in advance.

https://github.com/SteveBattista/hash_test

oconnor663 commented 3 years ago

Your code is single-threaded. I can see this because you're calling blake3::Hasher::update here. If you take a look at the update docs, you'll read:

This method is always single-threaded. For multi-threading support, see update_with_join below.

Note that the degree of SIMD parallelism that update can use is limited by the size of this input buffer. The 8 KiB buffer currently used by std::io::copy is enough to leverage AVX2, for example, but not enough to leverage AVX-512. A 16 KiB buffer is large enough to leverage all currently supported SIMD instruction sets.

The first line there is your answer. But the second part is also very relevant to your use case. It looks like you're using a 512-byte buffer, which is too small to take advantage of most of our SIMD optimizations.

So when you say "the code is just as fast as b3sum", that's not what I'd expect you to see. This could be for a few different reasons:

Maybe you're hashing very small inputs. For example, your 512-byte buffer size won't matter if your inputs are shorter than that.
Maybe you're limited by your disk read speed. If you're benchmarking hashing files, you have to be careful to make sure they're cached in memory, because the rate at which your CPU can hash bytes is almost always higher than the rate at which your disk can read them. Making sure a file is in cache can be a little tricky, but a couple rules of thumb: 1) Always do your benchmarks in a loop, and make sure to discard at least the first loop iteration. 2) Make sure you're not trying to hash a file that's too large to fit into memory.

Speaking of running benchmarks in a loop, I notice that here you're running each hash function one time back-to-back. Everything depends on the details of what you're doing, but I expect that that's very unlikely to give you meaningful, stable results. For example, in the common case where you're hashing some file that's not in cache yet, the first function in your loop is going to look artificially slow, because its performance is capped by disk read speed, while the subsequent loop iterations on the same file are not. Even if you avoid that issue, there are lots of sources of CPU performance noise that you have to worry about, and you'll find most benchmark suites do an average over many runs to try to account for that.

SteveBattista commented 3 years ago

Thank you so much for the tip. I think I have fixed all of the issues. I'm getting about 2.8GBs on my development machine (specs in the read me.

oconnor663 commented 3 years ago

These numbers look plausible, yes. Note that your CPU does not support AVX2 (or the much more recent AVX-512), so you can expect this long-input benchmark measurement to double or triple on a more recent machine.

BLAKE3-team / BLAKE3-specs

How can I tell if Blake3 is multi-threaded. #5