Compare hashes when gettoken

jakobnissen commented 2 years ago

Hello, and thanks for developing Dictionaries!

I noticed in the source code for Indices that, when checking for the presence of a key K which hashes to hash H, there is no check if H is in hashes.

In Python's new dict implementation in 3.6, this optimization was mentioned by the contributer. It does incur one extra array index (which can even incur a cache miss), but if the keys are expensive to compare with isequal, potentially a lot of time can be saved. Also, when the key do not match, there is overwhelming probability the hashes will not match either, and the next hash can be compared instead - which may speed up the check even further.

Not sure if you have already tried implementing it, but it may be a low hanging fruit!

andyferris commented 2 years ago

I think the plan would be to do the hash comparison when isequal is more expensive than hash. For Int etc it is not helpful. We’d want to do benchmarks with long-ish strings (perhaps with common prefix?) and see how impactful it is.

jakobnissen commented 2 years ago

I don't think they issue is how expensive hashing is, that needs to be done anyway. It's more whether the extra memory access is faster than the (possibly multiple) isequal comparison(s).

I'll try to make a benchmark with string keys in a week's time to test it. 👍

andyferris commented 2 years ago

Sorry - what I wrote was quite unclear! Yes I meant we should estimate the relative cost of isequal(value1, value2) vs isequal(hash1, hash2). Sometimes the former is the same cost as the latter and, unlike python, Julia has inlined values so avoiding the fetch can be relatively more beneficial than in Python (which would have to fetch the value).

Some test like isbits might be a decent way of figuring out which way to go?

andyferris / Dictionaries.jl

Compare hashes when gettoken #58