BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
5.02k stars 341 forks source link

Proposal: Simplify/improve Hash::PartialEq impls and remove constant_time_eq crate #417

Open silverstillisntgold opened 3 weeks ago

silverstillisntgold commented 3 weeks ago

Hello. I'm currently working on a small project for hashing (and subsequently verifying) entire file directories using blake3. As part of validating hashed file results, I need to use Hash::eq quite a lot, and while trying to micro-optimize everything I discovered that the current Hash::eq is pretty bad, and I think it can be better. As far as I can tell, constant_time_eq doesn't do anything special, but if there's a specific quality of it's comparison that blake3 relies on that makes a proposal like this unfeasible than I'm sorry to be a pain. Here's the asm currently generated by blake3. It seems like it's doing what's fundamentally a pretty simple operation in quite a complicated way. constant_time_eq

All that Hash::eq really needs to do (again, as far as I understand) is ensure all 256 bits of one Hash are equal to all 256 bits of another, so I wrote a naive AVX-based eq() to do so. custom

Then I dug into the rust lib and discovered that they have a specialized PartialEq implementation for same-size arrays. It uses internal intrinsics to tell the compiler to directly compare bytes, which generates an even better eq(). std_impl

It's also extremely simple to implement, with no conditional compilation, external crates, or hand-written asm required, since it just calls the built-in eq() and defers to the rust compiler to generate asm based on the target arch. In this case I was targeting x86-64-v3, but this works equally well for v2 (see godbolt link below) and v1.

The only potential issue I can think of with this approach is alignment. From everything I've been able to find online and from what little testing I've done, modern CPUs (I use a relatively old R7 2700 for testing) with AVX and greater don't seem to care about aligned vs. unaligned all that much. But older CPUs and non x86 cores might perform poorly. If this proposal were to be implemented, it might be beneficial to make Hash #[repr(align(32))], but I'm not sure if this could be considered a backwards-compatible change per semver. Since people using blake3 who have internal Structs holding Hash instances may see Struct sizes change, depending on how Struct members are arranged. Not sure how you'd want to handle that or if it's even worth considering but figured I'd mention it. It may be enough to leave alignment alone and let the compiler make it's own decisions based on the target arch.

Here's the godbolt if you're interested in seeing how different platforms respond to the change. Note that x86-64-v2 and v1 generate two additional loads when Hash isn't manually aligned to 32-bytes.

If this is something you'd be fine with I can make a PR including comments explaining usage of rust-provided eq() over alternatives.

BurningEnlightenment commented 3 weeks ago

As far as I can tell, constant_time_eq doesn't do anything special, but if there's a specific quality of it's comparison that blake3 relies on

Given that the description of constant_time_eq isn't particularly… extensive, I'll point you to the patch which introduced an equivalent function to the linux kernel: torvalds/linux@6bf37e5aa90f18baf5acf4874bca505dd667c37f . The TL;DR is that in an "online" scenario the timing of the comparison must not leak information about the (potentially secret) value being compared.

Now you might be inclined to point out that the generated AVX (or even SSE2) instructions only use constant-time-safe bit operations. However, switch the godbolt target to armv8 and you will see lots of ccmp being generated, i.e. the CPU can skip the subsequent comparisons.

oconnor663 commented 3 weeks ago

The requirement for a constant-time comparison here mostly concerns keyed_hash, which has the same security needs as standard HMAC. It's the same reason that for example hmac.compare_digest and secrets.compare_digest from Python both document their constant-time guarantees. The problem is that the first step in verifying an HMAC is computing the correct HMAC for the same input, and if it turns out verification fails and the message was forged, then that correct HMAC must remain secret. (Otherwise you become an HMAC "oracle" for your attacker.) If you use non-constant-time tests during the verification, it turns out that that leaks information about how close the wrong MAC was to the right one. See this post and this CryptoPals challenge for more.

I'm not surprised that constant_time_eq isn't as optimized as it could be. We could even write our own, for example doing some sort of atomic function pointer swap to make it efficient to call it at runtime without paying any extra branches for CPU feature detection. (It would still cost an indirect function call.) But I wouldn't want to worry about any of that unless we had a benchmark that showed it really mattered. My starting assumption is that the cost of invoking the compression function even once is going to completely dwarf the cost of this whole comparison.