HowardHinnant / hash_append

Implementation of hash_append proposal
67 stars 12 forks source link

Shouldn't hash_combine use different constants for 32/64 bit size_t? #7

Open IvanPizhenko opened 6 years ago

IvanPizhenko commented 6 years ago

Current code seems to be taken from Boost, but it is likely 32-bit oriented. Shouldn't the different constant be used for 64-bit size_t? https://github.com/HowardHinnant/hash_append/blob/bd892bfcbb1d84b05f87dd32cdb8c47e356c1a86/n3876.h#L34

HowardHinnant commented 6 years ago

This code came from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf and is used only for comparison purposes with that proposal.

IvanPizhenko commented 6 years ago

Hmm, and is there a way for non-comitee member, i.e. for just anyone, to submit such a remark to N3876? Because it is really 32-bit based. The paper on which Boost have based it, mentioned also in the N3876 is http://www.cs.rmit.edu.au/~jz/fulltext/jasist-tch.pdf and it says on the page 7: "The fingerprint generation function should produce as close as possible to a uniform distribution of integers [5]. The minutiae produced lie between two bounds—generally 0 and an arbitrarily high number such as 2^32 ". So in the Boost it likely was written at the times of 32-bit CPUs, and then N3876 blindly copied that.

IvanPizhenko commented 6 years ago

The constant itself should be coming from TEA algorithm, if we admit to process of its derivation, as described here https://softwareengineering.stackexchange.com/a/63599/188077 we must come up with different constant for different hash lengths. As follows:

HowardHinnant commented 6 years ago

This: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0814r2.pdf is the latest hash_combine proposal. It will be discussed this coming week in Jacksonville, FL. It looks poised to be standardized. I recommend emailing the author directly.

IvanPizhenko commented 6 years ago

Thank you, Howard, I have looked at the paper, and "proposed wording" looks quite generic and does not suggest any particular implementation, so there's actually no impact on the standard, and reference example still example and not final implementation. So looks like it is more proper to file this to the Boost.

NikolausDemmel commented 4 years ago

I'm just researching a little on how to create a simple hash-combine for my project, e.g. to support std::pair keys in std::unordered_map. I'd like to avoid adding an external dependency.

I came across the mentioned hash-combine

seed ^= std::hash<T>{}(val) + 0x9e3779b9 + (seed<<6) + (seed>>2);

in multiple places, and thus also this issue. Since I'm interested mostly in a version where size_t is 64bit, and I'm greateful for the "magic number" for 64bit hashes posted above. However, what is not clear, is what should be done to the bit-shifts. Can anyone here judge if this the following would be a reasonable hash combiner for 64bit?

seed ^= std::hash<T>{}(val) + 0x9e3779b97f4a7c15 + (seed<<6) + (seed>>2); 

Should the shifts be different? For my application it doesn't have to be perfect and I'd prefer a simple and fast solution, but I'd like to avoid obvious mistakes.

IvanPizhenko commented 4 years ago

Since 64 bit is 2x of 32 bit, you can try increasing shifts 2x. i.e. seed << 12 and seed >> 4

IvanPizhenko commented 4 years ago

To make it even more portable, maybe following can be done:

#include <cstdint>

template <typename T>
inline void hash_combine (std::uint16_t& seed, const T& val)
{
    seed ^= std::hash<T>{}(val) + 0x9e37U + (seed<<3) + (seed>>1);
}

template <typename T>
inline void hash_combine (std::uint32_t& seed, const T& val)
{
    seed ^= std::hash<T>{}(val) + 0x9e3779b9U + (seed<<6) + (seed>>2);
}

template <typename T>
inline void hash_combine (std::uint64_t& seed, const T& val)
{
    seed ^= std::hash<T>{}(val) + 0x9e3779b97f4a7c15LLU + (seed<<12) + (seed>>4);
}

Proper overload will be automatically invoked, depending on the actual size of the std::size_t.

NikolausDemmel commented 4 years ago

Awesome thanks for you reply. Makes sense to me. Might not be the best ever possible, but considering we were using a simple xor at some point to combine hashes, this is a step up. I ended up with using constexpr, since there might be some weirdness where size_t is neither uint32_t nor uint64_t.

// to work around static_assert(false, ...)
template <class T>
struct dependent_false : std::false_type {};

template <class T>
inline void hash_combine(std::size_t& seed, const T& value) {
  // Simple hash_combine, see e.g. here:
  // https://github.com/HowardHinnant/hash_append/issues/7
  // Not sure we ever need 32bit, but here it is...
  if constexpr (sizeof(std::size_t) == 4) {
    seed ^= std::hash<T>{}(value) + 0x9e3779b9U + (seed << 6) + (seed >> 2);
  } else if constexpr (sizeof(std::size_t) == 8) {
    seed ^= std::hash<T>{}(value) + 0x9e3779b97f4a7c15LLU + (seed << 12) +
            (seed >> 4);
  } else {
    static_assert(dependent_false<T>::value, "hash_combine not implemented");
  }
}
IvanPizhenko commented 4 years ago

@NikolausDemmel size_t is unsigned type by definition. So chances that it will not fit either into uint16_t, uint32_t or uint64_t are very low. Your solution with if constexpr obviously works, but requires C++17, while overloads will work for the earlier standards too.

NikolausDemmel commented 4 years ago

unsigned is clear, but what happened apparently is that size_t was unsigned long and uint64_t was unsigned long long, both 64bit, but distinct types. Fair point about C++17, that is fine for me, but maybe not in other cases. I guess whoever finds this can chose whatever suits them.

JVApen commented 2 years ago

Is there any update on the fix for this issue?

IvanPizhenko commented 2 years ago

@JVApen This is not a bug, just a suggestion. If you find it useful, you can take n3876.h from here and patch it yourself using code examples discussed above. Also, since the discussion is already quite old, and at the momemt none of the discussed above code example is included into n3876.h, I doubt @HowardHinnant will ever do it.

HowardHinnant commented 2 years ago

That's correct. hash_combine is not part of this proposal, nor do I recommend it. As the readme states, the entirety of this proposal is in one file: hash_append.h. Everything thing else is used as examples or comparisons.

IvanPizhenko commented 2 years ago

Despite @HowardHinnant doesn't recommend it, personally I still find n3876.h quite useful in the cases when it's desired to avoid Boost, of course until something better comes up within some next C++ standard.

CarloWood commented 2 years ago

Instead of writing code immediately, wouldn't it be better to first come up with a clear, mathematical description of what a hash function should accomplish?

At some point I was in the need for a hash_combine for pointers. Note that a pointer is unique almost by definition (unless it points to dynamically allocated memory and you free that memory allocation again), and often used as such, i.e. use pointers to global instances to assure uniqueness. Therefore there is no need to hash a pointer: it is its own hash. However, when combining two pointers there is quickly a problem from the fact that (assuming pointers to memory allocated from the heap) the distance between the pointers is relatively small (the high bits are equal) and aligned (the low bits are all zero). All entropy is concentrated into a small portion of the available bits (assume 64 bits). If the entropy of those bits are not spread out over the full range then the hash combine is bad because that will increase the chance on collisions.

What I did was look for collisions between any pair of two pointers from a large collection of pointers to "allocated" data; not really allocated - just all spaced 4 bytes apart with a different offset typical for malloc implementations (ie, near zero, near 2^32, near 2^64 and somewhere in between). A set of 10,000 of such "allocations" already gives rise to 100,000,000 pairs; by collapsing -say- every two bits into one (ie, assume 11 == 00, and 01 == 10; and other combinations) it is easy to do brute force searches. What I did instead was look at the "distance" between two hash values in terms of different number of bits. Aka, consider two hash values h1 and h2, then their distance is pop_count(h1 ^ h2). If h1 and h2 are totally uncorrelated then that distance behaves the same as if h1 and h2 are pure random numbers. Note that here h1 and h2 are both combined hashes from two "pointers". The result that I ended up with gave those statistics: even using 10,000 pointers in a concentrated region of memory and then considering all possible 100,000,000 pairs - all those pairs had a distance distribution that one would expect for pure random 64bit values.

This is just an example of such reasoning (and testing) though. I don't think that my hash_combine is suitable as a general hash_combine function (because it outputs zero when you input twice zero; although that is something that could easily be fixed, it DOES show that care has to be taken; it is also not a problem for my case, because it really is only intended for pointers to memory allocations, not null_ptr).

MichZia commented 1 year ago

Therefore there is no need to hash a pointer: it is its own hash.

This is wrong. Pointers tend to be multiple of 8, and hash tables often are 2^n size, so using bare pointers as key waste 7 over 8 hash entries, causing an average of says 4 collisions when a hash try to maintain a alloc/size ratio of 2. Also pointers are not random, specially when they point onto array/vector/deque elements. So even more collisions are to be expected.

CarloWood commented 1 year ago

Therefore there is no need to hash a pointer: it is its own hash.

This is wrong. Pointers tend to be multiple of 8, and hash tables often are 2^n size, so using bare pointers as key waste 7 over 8 hash entries, causing an average of says 4 collisions when a hash try to maintain a alloc/size ratio of 2. Also pointers are not random, specially when they point onto array/vector/deque elements. So even more collisions are to be expected.

Pointers are unique - so there is no reason to apply a hash algorithm on them. While two (different) pointers never collide (they are unequal) their hash MIGHT collide - so you only worsen the situation. If two pointers are equal then any hash will also be equal of course. A hash (ie std::hash) is also 8 bytes. Hashing a pointer is just nonsense. I have no idea what you think would be colliding. If pointers are different, they are different. There is no collision without hashing. Only hashes collide.

HowardHinnant commented 1 year ago

So the entire point of hash_append is that the client, not this library, decides how to hash pointers (and every other type). This is done by sending the desired Hasher as the first argument to hash_append. And it is Hasher that encodes some hashing algorithm ... not this library (not hash_append).

And even more importantly, Hasher hashes multiple types (including pointers potentially) as if the multiple values formed a contiguous stream of bytes. For example hashing two 8 byte pointers into a 8 byte hash can't be done with an identity function. Somehow Hasher does this. And it is hash_append's job to present those 16 bytes to Hasher as if they were 16 contiguous bytes, instead of 2 objects that might not even occupy contiguous memory.

This is accomplished by recognizing that hashing algorithms are composed of 3 stages:

  1. Initialization, possibly including seeding.
  2. Updating n bytes of memory into the hashing algorithm's state. n is a run-time value.
  3. Finalization of the algorithm's state into a fixed-size hash.

Steps 1 and 3 are always done once per hash. But step 2 can be done many times in the computation of a single hash. Thus step 2 can be used to input multiple types, with multiple values into a single hash.

MichZia commented 1 year ago

Pointers are unique - so there is no reason to apply a hash algorithm on them. While two (different) pointers never collide (they are unequal) their hash MIGHT collide - so you only worsen the situation. If two pointers are equal then any hash will also be equal of course. A hash (ie std::hash) is also 8 bytes. Hashing a pointer is just nonsense. I have no idea what you think would be colliding. If pointers are different, they are different. There is no collision without hashing. Only hashes collide.

First, hash values are clipped by the size of the hash table, so unicity of the pointer does not give any clue over collisions. A hash table of 7 entries will have a collision for hash values of 17 vs 24, both of them will go into cell 3.

Most hash implementation use a hastable with a power of 2 size because computing the modulo is way quicker (it's a bit mask) and as collision management implies to have a table significantly bigger that the number of entries, power of 2 fit the need with no drawbacks.

This make the choice to use identity as hash function for pointers and integers very bad. As pointers are most of the time multiple of 8, if not 16, due to malloc being conservative for memory alignement, using the identity as hash function for pointers will cause 7/15 over 8/16 hash entries to ba wasted,, and often cause an average of 4/8 collisions for every entry.

For integers, identity is also not good because there is lot of chance that the non random distribution of integers modulo 2^N will cause systematic collisions.

you can only reasonably use identity as hash if your hash table size is a large prime number, as a reccurence over a large prime is as improbable as the prime value.

IvanPizhenko commented 1 year ago

Well, that's a good point, didn't think about this from this point of view. So yes, we better apply a hash function on a pointer too. This can be done by casting it to uintptr_t and hashing that.

CarloWood commented 1 year ago

Ok, if you need a hash (of 64-bit) that is suited for clipping to any (smaller) number of bits, then what you need is to spread the entropy of the input to all bits of the hash, but with an emphasis on the lower significant bits, because the least significant the larger then chance they will be used. Since pointers have a very low entropy in the three or four least significant bits, a good start would be to rotate them 4 bits.

Nevertheless, this wasn't my point. The point of my first post was that it depends on the nature of the input and how the output is used (clipping thus) that will dictates requirements for a hash function; and then only once you understand all that in mathematical terms you can design a hash function.

On the topic of pointers, here is my implementation to hash_combine them:

// To add more pointers to hash h1.
inline uint64_t pointer_hash_combine(uint64_t h1, void const* p2)
{
  // Magic numbers.
  static constexpr uint64_t m1 = 0x73b7b5e0bd014e8d;    // uint64_t h = 1; for (int n = 0; n < 5; ++n) h ^= h * 0x200000008000208c; cout << hex
<< h;
  static constexpr uint64_t m2 = 0x9e3779b97f4a7c15;    // https://www.wolframalpha.com/input?i=floor%282%5E64+%2F+golden+ratio%29+in+base+16

  uint64_t i2 = reinterpret_cast<uint64_t>(p2);
#if 0
  // Spread low bits over full range of 64bits and combine the two, putting low bits over high bits.
  return (h1 * m1) ^ std::rotl((i2 << 6) + (i2 >> 2) * m2, 32);
#else
  // This works equally well.
  return (h1 * m1) - (i2 * m2);
#endif
}

// This function is suited to calculate a 64-bit hash from two 64-bit pointers to (heap allocated) memory.
inline uint64_t pointer_hash(void const* p1, void const* p2)
{
  return pointer_hash_combine(reinterpret_cast<uint64_t>(p1), p2);
}

The pointer_hash_combine function at first did some rotl and shifts to emphasis that we expect most entropy to be in certain bits. But the test program, that looks at the 'randomness' of the result of combining two pointers by looking at the distribution of the total number of different bits between two such hashes told me that simply doing (h1 * m1) - (i2 * m2) works equally well. However, this test did not take into account clipping, so perhaps that has to be re-tested.

PS This assumes that a pointer is 64-bit. So the if constexpr trick still needs to be applied.

MichZia commented 1 year ago

As described by the paper http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html this repo is dedicated to a way to hash any keys without the burden to choose the hahsing function and with no need of any hash_combine . Decoupling key definition and hash function is a great idea.

IvanPizhenko commented 1 year ago

hash_combine() worked good for me so far. When hash_append appear either officially in STL or in the Boost, I'll probably give it a try....

IvanPizhenko commented 1 year ago

But the idea of hash_append itself is good.

trbabb commented 4 months ago

Just a heads up for anyone who comes to this thread looking for the 128-bit hash_combine formula, @IvanPizhenko's constant is incorrect. The correct 128 bit value for 2^128 / phi is 0x9e3779b97f4a7c15f39cc0605cedc834. This can be verified using a python multiprecision library like mpmath.

IvanPizhenko commented 4 months ago

@trbabb Thank you for providing 128-bit constant. My constant was specifically for 64 bit.

CarloWood commented 4 months ago

No it wasn't :P. https://github.com/HowardHinnant/hash_append/issues/7#issuecomment-371758166 You gave "128-bit (if we whenever get it): 210 306 068 529 402 873 165 736 369 884 017 287 508 or 0x9e3779b97f4a7c15f39cc0605d396154. (Written a small Java program to compute that:"

Which is equal to 0x9e3779b97f4a7c15f39cc0605cedc834 for the first 100 bits too.