cocreature / thrill

Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
http://project-thrill.org
Other
0 stars 0 forks source link

Find a somewhat reasonable hash function #2

Closed cocreature closed 7 years ago

cocreature commented 7 years ago

The hash function is actually quite crucial for the algorithm to work correctly. In particular std::hash is pretty much useless because it hashes numbers to themselves. If you are counting numbers below some range, the first p bits and thereby the index will always be zero so m-1 registers are useless. For now I used val * 2654435761 as the hash function (shamelessly stolen from stackoverflow) but I highly doubt that this is sufficiently good.

We shouldn’t focus on hashing so it probably makes sense to use a relatively simple Hash function. The google paper mentions that they tested MD5, SHA1, SHA256 and MURMUR3 but haven’t found significant differences. So we could either use one of those or if we only want to test counting integers there might be simpler hash functions.

cocreature commented 7 years ago

I switched to a siphash implementation (also used in the Haskell package for hyperloglog) and the results are significantly better. We now seem to have relative errors of about 1-2% which seems about right.

cocreature commented 7 years ago

Closing, we can always revisit this later (and probably should) but for now I think this does the trick.