Open Alg0ritmus opened 2 years ago
RAYON_NUM_THREADS
environment variable. Or you can use their ThreadPoolBuilder
API in code. If you're using b3sum
on the command line, it provides a --num-threads
option, and you might want to take a look at how b3sum
implements that option.Speaking of doing weird things with the output offset :) you might also be interested in this recent paper: https://eprint.iacr.org/2022/283. There's now a security note about it.
Thank you so much, this really helped me. Hopefully I'd be able to deal with threads.. BTW 1 side question, is it "normal" that PyO3 binding gives me better performance than rust implementation? (Not a big difference, but overall PyO3 is doing better for most of the time).
That's an interesting question, and I've seen the same thing from time to time. The Python bindings wrap the Rust implementation, so there shouldn't really be any way for them to be faster, but there are lots of little quirks that can show up in benchmarks. Of course, all Rust code needs to be compiled with --release
for benchmarking. There are also cases where Python will do more caching than you might think. But I suspect the culprit is often inscrutable memory alignment effects, especially if you're looking into the second or third decimal place of performance.
Note that BLAKE3's own benchmark suite (cargo +nightly bench
) randomizes input memory offsets for this reason. Comparing those with benchmarks that read from "whatever offset the allocator happened to give" can be quirky.
Also another major source of quirkiness is TurboBoost and its related CPU downclocking behaviors. If one benchmark heats up the CPU more for unrelated reasons, e.g. shorter pauses between iterations, that can lead to more downclocking and lower measured performance. Disabling TurboBoost is almost always a good idea, at least as a point of comparison, and especially when you're trying to nail down suspicious quirks.
Well, I have to say that I did All measurements with enabled Turbo boost so probably thats the problem. Thanks!
Another issue that applies only to incremental hashing (so probbaly not your benchmarks) is using a too-small buffer size, as documented in the "performance note" here. The implementation can only parallelize over input that it can see, and in a buffered .update()
loop that means the size of the buffer rather than the size of the whole input. Unfortunately, some common APIs like std::io::copy
use a buffer size that's too small to take advantage of the latest greatest SIMD instructions on x86. This tends to result in large throughput differences where it applies though, like a factor of two or more, and it's not something that would explain +-1% differences.
I disabled TurboBoost, but results are basically the same: PY impl. (Hasher.update(file, multithreading=True) Time: 0.148s (best out of 5)
Rust impl. (update_rayon(file)) under --release Time: 0.186s (best out of 5) File is ~1GB
That's a pretty big difference. Can I run the benchmarks myself and take a look? Maybe put them in a Gist or similar?
Sure here's my git. Test file can be dowloaded from link (readme).
When I run that myself, here's what I see with TB off, best of 10:
Rust: 119.75ms
Python: 118.64ms
Similar results with TB on also. So Python is consistently a hair faster, not sure why, but definitely not the ~25% difference you're seeing. Not sure what it could be. (Are you on ARM? AArch64? Is there a chance one side is activating NEON and the other side isn't?)
Actually no, I'm using intel i5-8250U. (I need to remove neon feature from cargo.toml). Really don't know what could be the problem.
I don't see any significant difference on a similar machine to that i5-8250U. If anything, Python loses out most of the time by a tiny bit.
multithreading=True
requires the older 0.2.1 blake3-py. Maybe the difference is between whatever <= 0.2.1 did and what 0.3.1 does?
I've tried little benchmarks with v0.2.1(the one I used before) and v0.3.1 and results are very similar:
v.0.2.1: 0.1398s (best of 5 - tested again) v.0.3.1: 0.1394s (best of 5)
rust: 0.1423s (best of 10)
Difference is even smaller now (really don't know why), but still python is doing better for most of the time.
Hi, I'm currently writing my thesis and I'd like to know answers to few of my questions. Thanks for any help.
1) Is there any convenient way (method) for hashing data using particular amount of threads like update_rayon() that takes number of threads? (e.g. hashing data using just 2 threads, 4 threads...)
2) What is the limitation factor for input/output (2^64B). First I tought it's parameter (t0,t1), but I guess it's not the case.
3) In paper (https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf ) in section [5.3 - SIMD->1st-approach], you mentioned that diagonalization step applies rotation of rows. My questions is, Is it right/left rotation?