BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
5.1k stars 349 forks source link

I need to clarify some things #241

Open Alg0ritmus opened 2 years ago

Alg0ritmus commented 2 years ago

Hi, I'm currently writing my thesis and I'd like to know answers to few of my questions. Thanks for any help.

1) Is there any convenient way (method) for hashing data using particular amount of threads like update_rayon() that takes number of threads? (e.g. hashing data using just 2 threads, 4 threads...)

2) What is the limitation factor for input/output (2^64B). First I tought it's parameter (t0,t1), but I guess it's not the case.

3) In paper (https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf ) in section [5.3 - SIMD->1st-approach], you mentioned that diagonalization step applies rotation of rows. My questions is, Is it right/left rotation?

oconnor663 commented 2 years ago
  1. BLAKE3 uses the Rayon threadpool, and Rayon provides several different ways to configure the number of threads used or to run things inside a custom pool. The simplest way is to set the RAYON_NUM_THREADS environment variable. Or you can use their ThreadPoolBuilder API in code. If you're using b3sum on the command line, it provides a --num-threads option, and you might want to take a look at how b3sum implements that option.
  2. There isn't any particular check stopping you from consuming or producing more than 264 bytes. Part of the story is that it's unreasonably expensive and/or pointless for anyone to do this in practice, so it doesn't much matter what happens. (True UB wouldn't be ok, but crashing safely or producing an inconsistent/garbage answer are valid options.) One interesting case is on the output side, where a perverse caller might seek to 264 bytes and then keep incrementing this counter. Since that's a block counter and not a byte counter, you can actually go to something like 270 bytes in practice. Rust's default arithmetic behavior means that would eventually, after an extraordinary about of CPU work, panic in debug mode or overflow in release mode. This sort of thing is sensitive to implementation details, and it tends to change under refactoring and e.g. in optimized assembly implementations, so we document that the behavior is unspecified. At a high level, even if we wanted to guarantee some specific behavior in these cases, it's likely that implementations in other languages would have different behavior, and callers should avoid relying on any of this.
  3. To be clear, "rotation" in the diagonalization step acts on the four words in each row, which is different from the (rightward) "rotation" operation in the G function that acts on the bits within words. So in the diagonalization step, a rotation of 3 positions to the right is the same as 1 position to the left, and you can think about it either way. The important thing is just that the four diagonals are arranged in columns, and there's actually some interesting flexibility related to which columns they wind up in. See this lengthy comment on an in-progress branch of mine, as well as the BLAKE2 discussion that it links to. (Side node: @veorq likes to joke about how BLAKE/2/3's bit rotations go right but ChaCha's go left. Believe it or not, this was originally an accident.)
oconnor663 commented 2 years ago

Speaking of doing weird things with the output offset :) you might also be interested in this recent paper: https://eprint.iacr.org/2022/283. There's now a security note about it.

Alg0ritmus commented 2 years ago

Thank you so much, this really helped me. Hopefully I'd be able to deal with threads.. BTW 1 side question, is it "normal" that PyO3 binding gives me better performance than rust implementation? (Not a big difference, but overall PyO3 is doing better for most of the time).

oconnor663 commented 2 years ago

That's an interesting question, and I've seen the same thing from time to time. The Python bindings wrap the Rust implementation, so there shouldn't really be any way for them to be faster, but there are lots of little quirks that can show up in benchmarks. Of course, all Rust code needs to be compiled with --release for benchmarking. There are also cases where Python will do more caching than you might think. But I suspect the culprit is often inscrutable memory alignment effects, especially if you're looking into the second or third decimal place of performance.

oconnor663 commented 2 years ago

Note that BLAKE3's own benchmark suite (cargo +nightly bench) randomizes input memory offsets for this reason. Comparing those with benchmarks that read from "whatever offset the allocator happened to give" can be quirky.

Also another major source of quirkiness is TurboBoost and its related CPU downclocking behaviors. If one benchmark heats up the CPU more for unrelated reasons, e.g. shorter pauses between iterations, that can lead to more downclocking and lower measured performance. Disabling TurboBoost is almost always a good idea, at least as a point of comparison, and especially when you're trying to nail down suspicious quirks.

Alg0ritmus commented 2 years ago

Well, I have to say that I did All measurements with enabled Turbo boost so probably thats the problem. Thanks!

oconnor663 commented 2 years ago

Another issue that applies only to incremental hashing (so probbaly not your benchmarks) is using a too-small buffer size, as documented in the "performance note" here. The implementation can only parallelize over input that it can see, and in a buffered .update() loop that means the size of the buffer rather than the size of the whole input. Unfortunately, some common APIs like std::io::copy use a buffer size that's too small to take advantage of the latest greatest SIMD instructions on x86. This tends to result in large throughput differences where it applies though, like a factor of two or more, and it's not something that would explain +-1% differences.

Alg0ritmus commented 2 years ago

I disabled TurboBoost, but results are basically the same: PY impl. (Hasher.update(file, multithreading=True) Time: 0.148s (best out of 5)

Rust impl. (update_rayon(file)) under --release Time: 0.186s (best out of 5) File is ~1GB

oconnor663 commented 2 years ago

That's a pretty big difference. Can I run the benchmarks myself and take a look? Maybe put them in a Gist or similar?

Alg0ritmus commented 2 years ago

Sure here's my git. Test file can be dowloaded from link (readme).

oconnor663 commented 2 years ago

When I run that myself, here's what I see with TB off, best of 10:

Rust:   119.75ms
Python: 118.64ms

Similar results with TB on also. So Python is consistently a hair faster, not sure why, but definitely not the ~25% difference you're seeing. Not sure what it could be. (Are you on ARM? AArch64? Is there a chance one side is activating NEON and the other side isn't?)

Alg0ritmus commented 2 years ago

Actually no, I'm using intel i5-8250U. (I need to remove neon feature from cargo.toml). Really don't know what could be the problem.

sneves commented 2 years ago

I don't see any significant difference on a similar machine to that i5-8250U. If anything, Python loses out most of the time by a tiny bit.

multithreading=True requires the older 0.2.1 blake3-py. Maybe the difference is between whatever <= 0.2.1 did and what 0.3.1 does?

Alg0ritmus commented 2 years ago

I've tried little benchmarks with v0.2.1(the one I used before) and v0.3.1 and results are very similar:

v.0.2.1: 0.1398s (best of 5 - tested again) v.0.3.1: 0.1394s (best of 5)

rust: 0.1423s (best of 10)

Difference is even smaller now (really don't know why), but still python is doing better for most of the time.