fix Hash implementation

pascalkuthe commented 2 years ago

The current Hash implementation of Ropey violates the std API contract and therefore breaks with most non-default Hashers.

Ropeys current Hash implementation simply hashes all chunks in a loop. However, this means that the chunk boundries are encoded into the Hash. From the documentation of Hasher:

This trait provides no guarantees about how the various write_* methods are defined and implementations of [Hash] should not assume that they work one way or another. You cannot assume, for example, that a [write_u32] call is equivalent to four calls of [write_u8]. Nor can you assume that adjacent write calls are merged, so it's possible, for example, that
# fn foo(hasher: &mut impl std::hash::Hasher) {
hasher.write(&[1, 2]);
hasher.write(&[3, 4, 5, 6]);
# }
and
# fn foo(hasher: &mut impl std::hash::Hasher) {
hasher.write(&[1, 2, 3, 4]);
hasher.write(&[5, 6]);
# }
end up producing different hashes.

Thus to produce the same hash value, [Hash] implementations must ensure for equivalent items that exactly the same sequence of calls is made -- the same methods with the same parameters in the same order.

This means that no matter what the chunk boundaries are, the same write calls must happen. The tests did not catch this because the default Hasher does not exploit this property. However, most other Hashers with less strong guarantees about dos resistance (ahash, fxhash, fnvhash) do exploit this property. The default Hasher may also start to exploit this property at any point in the future (and it would not be considered a breaking change by std). When using such a Hasher with ropey the property hash(rope1) != hash(rope2) <=> rope1 != rope2 does not hold anymore. This leads to HashMaps with the same key present multiple times (or failed lookups when the key does exist).

The simplest fix for this is simply to call write_u8 for each byte of the chunk. However, this is extremely slow, because Hashers usually heavily optimize batch operations. For example for ahash this is at least 16 times slower, because it always operates on 16 bytes at once.

Instead, I have created a hash implementation that uses a temporary buffer to divide the bytes of the ropes into slices with length BYTES_MIN (except the last slice which may be smaller) and passes those to the Hasher. This causes a bit of overhead as some bytes need to be copied at the chunk boundaries. However, the overhead should be much smaller than calling write_u8 repeatedly and should only degrade performance slightly

pascalkuthe commented 2 years ago

Thanks for the review. I won't have time to get into this too much today but I will try to address your comments and performs some benchmarks later this week.

cessen commented 2 years ago

Yeah, no rush! I've actually come down with Covid (I tried to be pretty careful at the conference, masking up and everything, but nothing's 100%). So my brain isn't at it's best at the moment anyway, and I'll likely be getting to things a bit slowly for the next week-ish as I recover.

cessen commented 2 years ago

Thinking about it more, I actually have a somewhat specific implementation in mind for this, because I've implemented code like this many, many times before (ironically, internally to hashers, to avoid exactly this issue). But it's not reasonable to expect you to hit a narrow target like that via progressive comments in a code review.

So I think what probably makes the most sense is to accept this PR, and then I can do my own implementation afterwards if I have the energy.

~However, I would still like the fuzz tests to be moved to property tests, since the issue this addresses is a lot closer to a logic error than a safety issue.~ (I'll revisit this when my head is on straight again.)

cessen commented 1 year ago

So, I think what I'm going to do (if it's okay with you) is actually just redo this PR myself, but nabbing some of your unit tests. I was initially hesitant to do that because I didn't want to steal credit away from your work (which is very appreciated!). But I'll just make sure to credit you in the commit message(s). Moreover, I definitely plan to merge your other PRs once they pass review, of which #70 in particular is quite substantial.

Anyway, my rationale behind just re-doing it is:

I'll basically be reimplementing the Hash impl anyway.
I don't think we actually need any fuzzing/property-esk tests for this particular bug. Just normal unit tests are fine, to ensure we don't regress in the future.
Those tests can be improved by adding some private/doc-hidden APIs to Ropey to allow building a Rope with specific chunks, which I think will be useful for testing other things as well. That way we can directly construct the test cases we're looking for, rather than trying to indirectly coax Ropey into hopefully constructing things with the chunk boundaries we want, which isn't reliable into the future anyway.

So I'll just be taking care of that all at once. That okay with you @pascalkuthe?

pascalkuthe commented 1 year ago

So, I think what I'm going to do (if it's okay with you) is actually just redo this PR myself, but nabbing some of your unit tests. I was initially hesitant to do that because I didn't want to steal credit away from your work (which is very appreciated!). But I'll just make sure to credit you in the commit message(s). Moreover, I definitely plan to merge your other PRs once they pass review, of which #70 in particular is quite substantial.

Anyway, my rationale behind just re-doing it is:
* I'll basically be reimplementing the Hash impl anyway.

* I don't think we actually need any fuzzing/property-esk tests for this _particular_ bug.  Just normal unit tests are fine, to ensure we don't regress in the future.

* Those tests can be improved by adding some private/doc-hidden APIs to Ropey to allow building a Rope with specific chunks, which I think will be useful for testing other things as well.  That way we can directly construct the test cases we're looking for, rather than trying to indirectly coax Ropey into hopefully constructing things with the chunk boundaries we want, which isn't reliable into the future anyway.
So I'll just be taking care of that all at once. That okay with you @pascalkuthe?

Sure go ahead. It would be nice if you put some kind of credit into a commit but I mostly just care about that the bug gets fixed in a performant manner. Please also test with either ahash or fxhash as the performance differences are much more visible there then with siphash

archseer commented 1 year ago

But I'll just make sure to credit you in the commit message(s).

If you put a Co-authored-by: Name Here <email@org> in the commit message, github will correctly attribute that to both users :)

cessen commented 1 year ago

Done in fef5be9c9587974584bafa40c54a375ce6fbbc9a.

I also added benchmarks with the default hasher, fnv, and fxhash at various text sizes. Unsurprisingly (to me, at least), fnv is the slowest. People think of fnv as a "fast" hash, but it works per-byte which actually makes it pretty slow unless you're using extremely small input sizes.

To test overhead, I also implemented a bogus version that just passes the first chunk to the hasher and then stops. For the benches with very small inputs, this ends up being (incidentally) correct, but with as low overhead as possible. Compared to that, the impl I've ended up with has about 15% performance overhead on tiny inputs, for all the tested hashes, which doesn't seem too bad to me. That overhead should be notably less on larger inputs due to amortization, though I haven't tested yet to confirm.

In any case, it seems at least reasonably efficient.

pascalkuthe commented 1 year ago

Done in fef5be9.

I also added benchmarks with the default hasher, fnv, and fxhash at various text sizes. Unsurprisingly (to me, at least), fnv is the slowest. People think of fnv as a "fast" hash, but it works per-byte which actually makes it pretty slow unless you're using extremely small input sizes.

To test overhead, I also implemented a bogus version that just passes the first chunk to the hasher and then stops. For the benches with very small inputs, this ends up being (incidentally) correct, but with as low overhead as possible. Compared to that, the impl I've ended up with has about 15% performance overhead on tiny inputs, for all the tested hashes, which doesn't seem too bad to me. That overhead should be notably less on larger inputs due to amortization, though I haven't tested yet to confirm.

In any case, it seems at least reasonably efficient.

Awesome thanks for implementing a fix so quickly! Yeah fnv is not a great hasher. Fxhash is generally faster then siphahs tough ( ahahs is more robust against hash collisions at the const of being ever so slightly slower then FXhash when not compiling with AES CPU feature so I generally use that). The performance numbers sound good, how did you end up with 256 buffer size? I would have expected larger buffers to be fast or did that not make a difference when put to the test?

cessen commented 1 year ago

how did you end up with 256 buffer size?

Just trial-and-error, testing power-of-two sizes. I actually want to revisit that when I can bench on my main machine. I'm currently isolating in my bedroom, so I just have my tiny, slow laptop to use. And that means there's also a lot of CPU frequency scaling going on, which makes benchmarking a bit tricky (which is also why I've only tested powers of two so far--more fine-grained changes were too noisy on this machine). So once I can benchmark on my desktop machine, I might tweak things more based on those results.

I would have expected larger buffers to be fast or did that not make a difference when put to the test?

Of the sizes I tested, 256 was the fastest. 512 was about on par, but slightly slower. 1024 and larger sharply fell off in performance, and 128 and lower fell off somewhat less sharply.

The approach I implemented can skip data copying (just passing slices directly from the chunks) under certain conditions, one of which is that the chunk has to be larger-than-or-equal-to the block size. So making the block size smaller actually makes the zero-copy code path more likely. My suspicion is that it's that vs amortizing the loop overhead that led to this result. 256 is large enough to amortize the loop-and-test overhead fairly well, but small enough to frequently take the zero-copy code path. That's my guess, at least. More benchmarking/testing is still needed to narrow in on the best performance, I think.

That does mean that there's likely some relationship between the block size and average chunk size with respect to performance. But it's not clear if it's a linear relationship, so leaving it independent still seems best to me.

cessen / ropey

fix Hash implementation #66