Open stgatilov opened 9 months ago
I think a good solution would be to define custom tsl::hash<T>
.
Depending on __GLIBCXX__
(perhaps also check what's going on in libc++), it should either redirect to std::hash<T>
, or call std::hash<T>
and run its output through some hash finalizer.
Then replace std::hash
with tsl::hash
as default template argument in all class declarations.
GCC users will get proper hash function by default. No changes for non-GCC users. And hash function will remain fully tweakable.
Yes, at the time of creating the map, I used the default std::hash
as I didn't wanted to maintain my own hash function or have the repository depends on an external dependency, and adapted the hash in my projects depending on the usage.
Your solution would be a good compromise, I'll check to work on it when I have more time but don't hesitate to create a PR in the meantime if you have the time.
I started working on PR. Thus far I have added a unit test checking that default hash function on uint64_t depends on upper half of the integer.
I decided to check GCC / libstd-c++ to see which hash finalizer is used there. It turns out that no hash finalizer is used at all! They use prime size for hash table and simply hope that "modulo size" reduction is good enough.
For example, this code takes 8 seconds for 50K elements on my machine:
std::unordered_set<uint64_t> set;
set.reserve(NumElements);
size_t p = set.bucket_count();
for (uint64_t i = 0; i < NumElements; i++)
set.insert(i * p);
The same happens on Clang / libc++.
Clearly, this issue is very unlikely to happen unintentionally if prime size is used.
In the PR, I decided to use hash finalizers from MurmurHash2. They are rather well-known and do very few operations. Although 64-bit case looks weird: it seems that output bits in [17..47) range don't depend on the higher bits in the same range.
The newer MurmurHash3 uses one more shift & multiply round in finalizer. Maybe it should be used in 64-bit case? More specifically, the fmix64 function.
I ran the following code built with GCC in Linux VM:
and I got:
So I guess inserting integers divisible by 2^20 takes quadratic time. Moreover, trying to insert 16M values results in a crash. Most likely because
std::hash<uint64_t>(x) = x
on GCC.Note that I used default settings and got no warnings! Awful hash function by default is rather critical issue for people who don't know much about hashing (and would probably do worse trying to implement their own hash function or hash table). And given that TSL interface is very STL-like, I think that's the audience it is targeted at.
A proper hash function usually contains three parts:
uint64_t
to an index in hash table.As usual, C++ standard is not precise enough, and STL is not cross-platform. On MSVC, std::hash performs steps 1 and 2, while std::unordered_set only does step 3. On GCC, std::hash only performs step 1, while std::unordered_set does steps 2 and 3. It means that if you use std::hash directly, then you should run your own hash finalizer. TSL hash table only does step 3, but uses std::hash, meaning that the crucial step 2 is missed.