Closed pascalkuthe closed 1 year ago
Thanks for the review. I won't have time to get into this too much today but I will try to address your comments and performs some benchmarks later this week.
Yeah, no rush! I've actually come down with Covid (I tried to be pretty careful at the conference, masking up and everything, but nothing's 100%). So my brain isn't at it's best at the moment anyway, and I'll likely be getting to things a bit slowly for the next week-ish as I recover.
Thinking about it more, I actually have a somewhat specific implementation in mind for this, because I've implemented code like this many, many times before (ironically, internally to hashers, to avoid exactly this issue). But it's not reasonable to expect you to hit a narrow target like that via progressive comments in a code review.
So I think what probably makes the most sense is to accept this PR, and then I can do my own implementation afterwards if I have the energy.
~However, I would still like the fuzz tests to be moved to property tests, since the issue this addresses is a lot closer to a logic error than a safety issue.~ (I'll revisit this when my head is on straight again.)
So, I think what I'm going to do (if it's okay with you) is actually just redo this PR myself, but nabbing some of your unit tests. I was initially hesitant to do that because I didn't want to steal credit away from your work (which is very appreciated!). But I'll just make sure to credit you in the commit message(s). Moreover, I definitely plan to merge your other PRs once they pass review, of which #70 in particular is quite substantial.
Anyway, my rationale behind just re-doing it is:
So I'll just be taking care of that all at once. That okay with you @pascalkuthe?
So, I think what I'm going to do (if it's okay with you) is actually just redo this PR myself, but nabbing some of your unit tests. I was initially hesitant to do that because I didn't want to steal credit away from your work (which is very appreciated!). But I'll just make sure to credit you in the commit message(s). Moreover, I definitely plan to merge your other PRs once they pass review, of which #70 in particular is quite substantial.
Anyway, my rationale behind just re-doing it is:
* I'll basically be reimplementing the Hash impl anyway. * I don't think we actually need any fuzzing/property-esk tests for this _particular_ bug. Just normal unit tests are fine, to ensure we don't regress in the future. * Those tests can be improved by adding some private/doc-hidden APIs to Ropey to allow building a Rope with specific chunks, which I think will be useful for testing other things as well. That way we can directly construct the test cases we're looking for, rather than trying to indirectly coax Ropey into hopefully constructing things with the chunk boundaries we want, which isn't reliable into the future anyway.
So I'll just be taking care of that all at once. That okay with you @pascalkuthe?
Sure go ahead. It would be nice if you put some kind of credit into a commit but I mostly just care about that the bug gets fixed in a performant manner. Please also test with either ahash or fxhash as the performance differences are much more visible there then with siphash
But I'll just make sure to credit you in the commit message(s).
If you put a Co-authored-by: Name Here <email@org>
in the commit message, github will correctly attribute that to both users :)
Done in fef5be9c9587974584bafa40c54a375ce6fbbc9a.
I also added benchmarks with the default hasher, fnv, and fxhash at various text sizes. Unsurprisingly (to me, at least), fnv is the slowest. People think of fnv as a "fast" hash, but it works per-byte which actually makes it pretty slow unless you're using extremely small input sizes.
To test overhead, I also implemented a bogus version that just passes the first chunk to the hasher and then stops. For the benches with very small inputs, this ends up being (incidentally) correct, but with as low overhead as possible. Compared to that, the impl I've ended up with has about 15% performance overhead on tiny inputs, for all the tested hashes, which doesn't seem too bad to me. That overhead should be notably less on larger inputs due to amortization, though I haven't tested yet to confirm.
In any case, it seems at least reasonably efficient.
Done in fef5be9.
I also added benchmarks with the default hasher, fnv, and fxhash at various text sizes. Unsurprisingly (to me, at least), fnv is the slowest. People think of fnv as a "fast" hash, but it works per-byte which actually makes it pretty slow unless you're using extremely small input sizes.
To test overhead, I also implemented a bogus version that just passes the first chunk to the hasher and then stops. For the benches with very small inputs, this ends up being (incidentally) correct, but with as low overhead as possible. Compared to that, the impl I've ended up with has about 15% performance overhead on tiny inputs, for all the tested hashes, which doesn't seem too bad to me. That overhead should be notably less on larger inputs due to amortization, though I haven't tested yet to confirm.
In any case, it seems at least reasonably efficient.
Awesome thanks for implementing a fix so quickly! Yeah fnv is not a great hasher. Fxhash is generally faster then siphahs tough ( ahahs is more robust against hash collisions at the const of being ever so slightly slower then FXhash when not compiling with AES CPU feature so I generally use that). The performance numbers sound good, how did you end up with 256 buffer size? I would have expected larger buffers to be fast or did that not make a difference when put to the test?
how did you end up with 256 buffer size?
Just trial-and-error, testing power-of-two sizes. I actually want to revisit that when I can bench on my main machine. I'm currently isolating in my bedroom, so I just have my tiny, slow laptop to use. And that means there's also a lot of CPU frequency scaling going on, which makes benchmarking a bit tricky (which is also why I've only tested powers of two so far--more fine-grained changes were too noisy on this machine). So once I can benchmark on my desktop machine, I might tweak things more based on those results.
I would have expected larger buffers to be fast or did that not make a difference when put to the test?
Of the sizes I tested, 256 was the fastest. 512 was about on par, but slightly slower. 1024 and larger sharply fell off in performance, and 128 and lower fell off somewhat less sharply.
The approach I implemented can skip data copying (just passing slices directly from the chunks) under certain conditions, one of which is that the chunk has to be larger-than-or-equal-to the block size. So making the block size smaller actually makes the zero-copy code path more likely. My suspicion is that it's that vs amortizing the loop overhead that led to this result. 256 is large enough to amortize the loop-and-test overhead fairly well, but small enough to frequently take the zero-copy code path. That's my guess, at least. More benchmarking/testing is still needed to narrow in on the best performance, I think.
That does mean that there's likely some relationship between the block size and average chunk size with respect to performance. But it's not clear if it's a linear relationship, so leaving it independent still seems best to me.
The current
Hash
implementation ofRopey
violates the std API contract and therefore breaks with most non-defaultHasher
s.Ropeys current
Hash
implementation simply hashes all chunks in a loop. However, this means that the chunk boundries are encoded into theHash
. From the documentation ofHasher
:This means that no matter what the chunk boundaries are, the same
write
calls must happen. The tests did not catch this because the defaultHasher
does not exploit this property. However, most otherHasher
s with less strong guarantees about dos resistance (ahash
,fxhash
,fnvhash
) do exploit this property. The defaultHasher
may also start to exploit this property at any point in the future (and it would not be considered a breaking change by std). When using such aHasher
with ropey the propertyhash(rope1) != hash(rope2) <=> rope1 != rope2
does not hold anymore. This leads toHashMap
s with the same key present multiple times (or failed lookups when the key does exist).The simplest fix for this is simply to call
write_u8
for each byte of the chunk. However, this is extremely slow, becauseHashers
usually heavily optimize batch operations. For example forahash
this is at least 16 times slower, because it always operates on 16 bytes at once.Instead, I have created a hash implementation that uses a temporary buffer to divide the bytes of the ropes into slices with length
BYTES_MIN
(except the last slice which may be smaller) and passes those to theHasher
. This causes a bit of overhead as some bytes need to be copied at the chunk boundaries. However, the overhead should be much smaller than callingwrite_u8
repeatedly and should only degrade performance slightly