dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.72k forks source link

Poor hash function in .Net source #14480

Closed knazeri closed 4 years ago

knazeri commented 9 years ago

The CombineHashCodes in System.Numerics.HashCodeHelper has a very high collision rate for uniformly distributed data! The method probably first appeared in System.Web.Util.HashCodeCombiner class and then used in System.Array, System.Tuple and dotnet/corefx. Here's an example:

Assuming the CombineHashCodes method's implementation is:

internal static int CombineHashCodes(int h1, int h2)
{
    return (((h1 << 5) + h1) ^ h2);
}

100K raw data results in more than 6K duplicate hash values, that is 6% of a uniformly distributed set:

var data = Enumerable.Range(0, 100000)
    .Select(t => new
    {
        Val1 = t,
        Val2 = 100000 + t
    })
    .GroupBy(t => CombineHashCodes(t.Val1.GetHashCode(), t.Val2.GetHashCode()))
    .Where(t => t.Count() > 1)
    .Select(t => new { Hash = t.Key, Items = t.ToList() })
    .ToList();

I believe there are much better hash functions to use like the one Josh Bloch suggests in Effective Java, which results in no hash collision even for 1M records of data in example above:

internal static int CombineHashCodes(int h1, int h2)
{
    return (17 * 31 + h1) * 31 + h2;
}
ghost commented 9 years ago

:+1:

redknightlois commented 9 years ago

While a bit more involved, and potentially more costly Thomas Wang's 64bits to 32bits sounds like what you are after. If you want to combine 2 ints, you load them in a long using a shift << 32 and use the downscaling hash function.

public int Hash64to32(long key)
{
  key = (~key) + (key << 18);
  key = key ^ (key >> 31);
  key = key * 21;
  key = key ^ (key >> 11);
  key = key + (key << 6);
  key = key ^ (key >> 22);
  return (int) key;
}

Shamelessly copied from: https://gist.github.com/badboy/6267743

What would make sense though is that the JIT would define an intrinsec for this with the fastest code available on the platform.

DenisIstomin commented 8 years ago

I would like to fix that issue. What I found:

  1. Current algorithm ((h1 << 5) + h1) ^ h2 is known as "djb Bernstain" hash function.
  2. Another version (described here) "djb2" with "magic constant" 5381 gives worse result on test data described above.
  3. OpenJDK version (would be like this: unchecked (31 * h1 + h2)) works much better. And I think that it is the right choise:
public int hashCode() {
  int hashCode = 1;
  for (E e : this)
    hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
    return hashCode;
}

Performance is not so different (but I measured Debug version).

Summary: I think that OpenJDK version of algorithm is best candidate, but more performance\functional tests are needed.

@ellismg, @stephentoub, @joshfree, do I need to go further and make PR?

knazeri commented 8 years ago

The CombineHash method in System.Numerics is very fast and it has a very low collision rate. From the labels assigned to the post, I reckon that would be the accepted PR.

DenisIstomin commented 8 years ago

Compared with function recently changed here – funcion in my PR works faster. Measured with "Microsoft.DotNet.xunit.performance" Release version. Don't know why it's faster.

cc @VSadov

jamesqo commented 8 years ago

@KamyarNazeri

The CombineHash method in System.Numerics is very fast and it has a very low collision rate

Really? It looks like for small numbers it will be equivalent to just (h1 << 7) ^ h2. This is going to be easier to create collisions with since it's just multiplying by a power of 2.

knazeri commented 8 years ago

@jamesqo

it's just multiplying by a power of 2

Correct, for numbers less than 2^25 it is basically a simple XOR: (h1 * 128) ^ h2. However, the collision rate is actually very low. Beats me, I tested it against FNV, JSW and Bernstein hash algorithms for uniformly distributed data and it does better than all of them. There are probably better test scenarios, but here's a quick sample:

int count = 10000000;
Random rnd = new Random();
double distinct = new HashSet<int>(Enumerable.Range(0, count).Select(t => (t << 7) ^ rnd.Next())).Count;
double rate = ((count - distinct) / count) * 100;

For 10,000,000 records of uniformly distributed integers, the collision rate is somewhere between 0.0001% and 0.001%

jamesqo commented 8 years ago

@KamyarNazeri I'm still a bit skeptical-- for example if all of the inputs are 0 to that function, no matter how many 0s you feed it the output will still be 0. Similarly, if u1 is 0 then the result will be u2.

So the following will produce collisions, assuming that was the hash algorithm used for Tuple:

var dictionary = new Dictionary<object, string>();

dictionary[Tuple.Create(null, null)] = "foo";
dictionary[Tuple.Create(null, null, null)] = "foo";

int age = 1;
dictionary[Tuple.Create(0, age)] = "bar";
dictionary[age] = "bar";

In fact the algorithm I recently submitted (linked to by @DenisIstomin) suffers from this same problem too. Maybe it would be best to change it something like this formula-- it makes use of a magic constant, so the risk of a special number causing patterns in the hash goes down dramatically. It's apparently used in Boost and there is a proposal to add it to the C++ stdlib.