dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Proposal: Add System.HashCode to make it easier to generate good hash codes. #19621

Closed jamesqo closed 4 years ago

jamesqo commented 7 years ago

Update 6/16/17: Looking for volunteers

The API shape has been finalized. However, we're still deciding on the best hash algorithm out of a list of candidates to use for the implementation, and we need someone to help us measure the throughput/distribution of each algorithm. If you'd like to take that role up, please leave a comment below and @karelz will assign this issue to you.

Update 6/13/17: Proposal accepted!

Here's the API that was approved by @terrajobst at https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321:

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        [EditorBrowsable(Never)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

The original text of this proposal follows.

Rationale

Generating a good hash code should not require use of ugly magic constants and bit twiddling on our code. It should be less tempting to write a bad-but-concise GetHashCode implementation such as

class Person
{
    public override int GetHashCode() => FirstName.GetHashCode() + LastName.GetHashCode();
}

Proposal

We should add a HashCode type to enscapulate hash code creation and avoid forcing devs to get mixed up in the messy details. Here is my proposal, which is based off of https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329, with a few minor revisions.

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);
        public void AddRange<T>(T[] values);
        public void AddRange<T>(T[] values, int index, int count);
        public void AddRange<T>(T[] values, int index, int count, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

Remarks

See @terrajobst's comment at https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329 for the goals of this API; all of his remarks are valid. I would like to point out these ones in particular, however:

morganbr commented 7 years ago

The criteria that would help us choose algorithms would be:

  1. Does the algorithm have a good avalanche effect? That is, does every bit of input have a 50% chance of flipping every bit of output? This site has a study of several popular algorithms.
  2. Is the algorithm fast for small inputs? Since HashCode.Combine will generally handle 8 or fewer ints, startup time may be more important than throughput. This site has an interesting set of data to start with. This is also where we may need different answers for different architectures or other pivots (OS, AoT vs JIT, etc).

What we'd really like to see is performance numbers for candidates written in C# so that we can be reasonably confident their characteristics will hold up for .NET. If you write a candidate and we don't pick it for this, that will still be useful work whenever I actually get the API proposal together for the non-cryptographic hash API.

Here are some candidates I think are worth evaluating (but feel free to propose others):

benaadams commented 7 years ago

Shame the Add methods can't have a return type of ref HashCode and return ref this so they can be used in a fluent manner,

Would readonly ref returns allow this? /cc @jaredpar @VSadov

karelz commented 7 years ago

WARNING: If anyone picks hash implementation from existing code base somewhere on the internet, please keep the link to the source and check the license (we will have to do it as well).

If the license is not compatible, we may need to write the algorithm from scratch.

CyrusNajmabadi commented 7 years ago

IMO, using the Add methods should be exceedingly uncommon. It will be for very advanced scenarios, and the need to be able to be 'fluent' will not really be there.

For the common use cases for 99% of all user code cases, one should be able to juse use => HashCode.Combine(...) and be fine.

terrajobst commented 7 years ago

@morganbr

we might also want public static int Combine<T1>(T1 value);. I know it looks a little funny, but it would provide a way of diffusing bits from something with a limited input hash space

Make sense. I've added it.

@justinvp

Nit: The StringComparison parameter should be named comparisonType to match the naming used everywhere else StringComparison is used as a parameter.

Fixed.

terrajobst commented 7 years ago

@CyrusNajmabadi

IMO, using the Add methods should be exceedingly uncommon. It will be for very advanced scenarios, and the need to be able to be 'fluent' will not really be there.

Agreed.

VSadov commented 7 years ago

@benaadams - re: ref returning this from Add - no, this cannot be returned by ref in struct methods since it can be an rValue or a temp.

ref var r = (new T()).ReturnsRefThis();

// r refers to some variable here. Which one? What is the scope/lifetime?
r = SomethingElse();
bgrainger commented 7 years ago

In case it's useful for comparison purposes, some years ago I ported the Jenkins lookup3 hash function (C source) to C# here.

svick commented 7 years ago

I'm wondering about collections:

@terrajobst

public void Add<T>(T[] value);

Why is there an overload for arrays, but not one for general collections (i.e. IEnumerable<T>)?

Also, isn't it going to be confusing that HashCode.Combine(array) and hashCode.Add((object)array) behave one way (use reference equality) and hashCode.Add(array) behaves another way (combines hash codes of the values in the array)?

@CyrusNajmabadi

For the common use cases for 99% of all user code cases, one should be able to just use => HashCode.Combine(...) and be fine.

If the aim is really to be able to use Combine in 99 % of use cases (and not, say, 80 %), then shouldn't Combine somehow support hashing collections based on the values in the collection? Maybe there should be a separate method that does that (either an extension method or static method on HashCode)?

morganbr commented 7 years ago

If Add is a power scenario, should we assume the user should choose between Object.GetHashCode and combining individual elements of collections? If it would help, we could consider renaming the array (and potential IEnumerable) versions. Something like:

public void AddEnumerableHashes<T>(IEnumerable<T> enumerable);
public void AddEnumerableHashes<T>(T[] array);
public void AddEnumerableHashes<T>(T[] array, int index, int length);

I wonder whether we'd also need overloads with IEqualityComparers.

SLaks commented 7 years ago

Proposal: Make the builder struct implement IEnumerable to support collection initializer syntax:

return new HashCode {
    SomeField,
    OtherField,
    { SomeString, StringComparer.UTF8 },
    { SomeHashSet, HashSet<int>.CreateSetComparer() }
}.GetHashCode()

This is much more elegant than calling Add() by hand (in particular, you don't need a temporary variable), and still has no allocations.

more details

svick commented 7 years ago

@SLaks Maybe that nicer syntax could wait for https://github.com/dotnet/csharplang/issues/455 (assuming that proposal had support), so that HashCode wouldn't have to implement bogus IEnumerable?

justinvp commented 7 years ago

We decided to not override GetHashCode() to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (GetHashCode() should return the object's hash code, not the one being computed).

I find it strange that GetHashCode isn't going to return the computed hashcode. I think this is going to confuse developers. For example, @SLaks already used it in his proposal instead of using ToHashCode.

svick commented 7 years ago

@justinvp If GetHashCode() is not going to return the computed hash code, it should probably be marked [Obsolete] and [EditorBrowsable(Never)].

On the other hand, I don't see the harm in it returning the computed hash code.

@terrajobst

We decided to not override GetHashCode() to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (GetHashCode() should return the object's hash code, not the one being computed).

Yes, GetHashCode() should return the object's hash code, but is there any reason why the two hash codes should be different? It's still correct, since two instances of HashCode with the same internal state will return the same value from GetHashCode().

jamesqo commented 7 years ago

@terrajobst I just saw your comment. Forgive me for the delayed reply, I was slow to look into the notification because I thought it would only be more back-and-forth that was going nowhere. Glad to see that's not the case! :tada:

I'd be delighted to pick this up and do the throughput/distribution measuring (I assume that's what you meant by "interested in working on this area"). Give me a second to finish reading through all of the comments here, though.

jamesqo commented 7 years ago

@terrajobst

Can we change

public void Add<T>(T[] value);
public void Add<T>(T[] value, int index, int length);
public void Add(byte[] value);
public void Add(byte[] value, int index, int length);

to

public void AddRange<T>(T[] values);
public void AddRange<T>(T[] values, int index, int count);
public void AddRange<T>(T[] values, int index, int count, IEqualityComparer<T> comparer);

? I renamed Add -> AddRange to avoid the behavior @svick mentioned. I removed the byte overloads since we can specialize using typeof(T) == typeof(byte) inside the method if we need to do anything byte-specific. Also, I changed value -> values and length -> count. It also makes sense to have a comparer overload.

jamesqo commented 7 years ago

@terrajobst Can you remind me why

        public void Add(string value);
        public void Add(string value, StringComparison comparisonType);

is necessary when we have

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);

?

jamesqo commented 7 years ago

@svick

@justinvp If GetHashCode() is not going to return the computed hash code, it should probably be marked [Obsolete] and [EditorBrowsable(Never)].

:+1:

@terrajobst Can we go back to having an implicit conversion from HashCode -> int, so no ToHashCode method? edit: ToHashCode is fine. See @CyrusNajmabadi's response below.

SLaks commented 7 years ago

@jamesqo StringComparison is an enum. However, people could use the equivalent StringComparer instead.

CyrusNajmabadi commented 7 years ago

Can we go back to having an implicit conversion from HashCode -> int, so no ToHashCode method?

We discussed this and decided against it in the meeting. The issue is that when the user gets the final 'int' that extra work is often done. i.e. the internals of the hashcode will often do a finalization step, and may reset itself to a fresh state. Having that happen with an implicit conversion would be strange. If you did this:

HashCode hc = ...

int i1 = hc;
int i2 = hc;

Then you could get different results.

For that reason, we also don't like the explicit conversion (as people don't think of conversions as changing internal state).

With a method we can document explicitly that this is happening. We can even potentially name it to convey it as much. i.e. "ToHashCodeAndReset" (though we decided against that). But at least the method can have clear documentation on it that hte user can see in things like intellisense. That's not really the case with conversions.

CyrusNajmabadi commented 7 years ago

I removed the byte overloads since we can specialize using typeof(T) == typeof(byte)

IIRC there was some concern about this not being ok from the JIT perspective. But that may have only been for the non-value-type "typeof()" cases. As long as the jit will effectively do the right thing for the value-type typeof() cases, then that should be good.

jamesqo commented 7 years ago

@CyrusNajmabadi I was unaware that conversion to an int might involve mutating state. ToHashCode it is then.

blowdart commented 7 years ago

For those thinking about the crypto perspective - http://tuprints.ulb.tu-darmstadt.de/2094/1/thesis.lehmann.pdf

jamesqo commented 7 years ago

@terrajobst, have you had time to read my comments (starting from here) and decide whether you approve of the tweaked API shape? If so, then I think this can be marked api-approved/up for grabs and we can start deciding on a hash algorithm.

morganbr commented 7 years ago

@blowdart , any particular part of that you'd like to highlight?

I might not have been too explicit about it above, but the only non-cryptographic hashes I don't know of HashDoS breaks in are Marvin and SipHash. That is, even seeding (say) Murmur with a random value can still be broken and used for a DoS.

blowdart commented 7 years ago

None, I just found it interesting, and I'm thinking the docs for this ought to say "Not for use on hash codes which are generated via cryptographic algorithms."

terrajobst commented 7 years ago

Decisions

This leaves us with:

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        [EditorBrowsable(Never)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}
karelz commented 7 years ago

Next steps: The issue is up-for-grabs -- to implement the API we need with several candidate algorithms as experiments -- see https://github.com/dotnet/corefx/issues/14354#issuecomment-305028686 for list, so that we can decide which algorithm to take (based on throughput and distribution measurements, likely different answer per CPU architecture).

Complexity: Large

If anyone is interested in picking it up, please ping us. There might be even room for several people working on it together. (@jamesqo you have priority choice as you invested most & longest into the issue)

jamesqo commented 7 years ago

@karelz Despite my comment above, I have changed my mind because I do not think I have the qualifications to choose the best hash algorithm. I looked into some of the libraries @morganbr listed and realized the implementation is quite complex, so I can't easily translate it to C# to test for myself. I have little background in C++, so I would also have a hard time just installing the library and writing a test app.

I do not want this to remain on the up-for-grabs list forever, though. If no one takes it up a week from today, I will consider posting a question on Programmers SE or Reddit.

tannergooding commented 7 years ago

I've not benched it (or otherwise optimized it), but here is a basic implementation of the Murmur3 hash algorithm that I use in several of my personal projects: https://gist.github.com/tannergooding/0a12559d1a912068b9aeb4b9586aad7f

tannergooding commented 7 years ago

I feel like the most optimal solution here is going to be to dynamically change the hashing algorithm based on the size of the input data.

Ex: Mumur3 (and others) are very fast for large sets of data and provide great distribution, but they can perform 'poorly' (speed wise, not distribution wise) for smaller sets of data.

I imagine we should do something like: If overall byte count is less than X, do algorithm A; otherwise, do algorithm B. This will still be deterministic (per run), but will allow us to provide speed and distribution based on the actual size of the input data.

tannergooding commented 7 years ago

It's probably also worth noting that several of the algorithms mentioned have implementations specifically designed for SIMD instructions, so an the most performant solution would likely involve an FCALL at some level (like is done with some of the BufferCopy implementations) or may involve taking a dependency on System.Numerics.Vector.

morganbr commented 7 years ago

@jamesqo, we're happy to help out with making choices; what we need the most help with is performance data for candidate implementations (ideally C#, though as @tannergooding points out, some algorithms need special compiler support). As I mentioned above, if you build a candidate that isn't chosen, we'll probably use it later, so don't worry about work being wasted.

I know there are benchmarks out there for various implementations, but I think it's important to have a comparison using this API and a likely range of inputs (e.g. structs with 1-10 fields).

morganbr commented 7 years ago

@tannergooding, that kind of adaptability might be most performant, but I don't see how it would work with the Add method since it doesn't know how many times it'll get called. While we could do it with Combine, that would mean a series of Add calls could produce a different result than the corresponding Combine call.

Also, given that the most likely range of inputs is 4-32 bytes (Combine`1-Combine`8), hopefully there aren't big performance changes over that range.

tannergooding commented 7 years ago

that kind of adaptability might be most performant, but I don't see how it would work with the Add method since it doesn't know how many times it'll get called.

I'm not personally convinced the API shape is quite right for general purpose hashing (it is close however)...

Currently we are exposing Combine methods for static construction. If these are meant to combine all inputs and produce a finalized hash code, then the name is 'poor' and something like Compute might be more appropriate.

If we are exposing Combine methods, they should just mix all inputs and users should be required to call a Finalize method which takes the output from the last combine as well as total number of bytes that were combined to produce a finalized hash code (finalizing a hash code is important as it is what causes the bits to avalanche).

For the builder pattern, we are exposing an Add and ToHashCode method. It is not clear if the Add method is meant to store the bytes and only combine/finalize on the call to ToHashCode (in which case we can pick the correct algorithm dynamically) or if they are meant to be combined on the fly, it should be clear that this is the case (and that the implementation should be internally tracking the total size of bytes combined).

morganbr commented 7 years ago

For anyone looking for a less complicated starting point, try xxHash32. That's likely to translate pretty easily to C# (people have done it).

tannergooding commented 7 years ago

Still testing locally, but I am seeing the following throughput rates for my C# implementation of Murmur3.

These are for the static Combine methods for 1-8 inputs:

1070.18 mb/s
1511.49 mb/s
1674.89 mb/s
1957.65 mb/s
2083.24 mb/s
2140.94 mb/s
2190.27 mb/s
2245.53 mb/s

My implementation assumes that GetHashCode should be called for each input and that the computed value should be finalized before being returned.

I combined int values, as they are the simplest to test.

To compute the throughput, I ran 10,001 iterations, throwing away the first iteration as the 'warmup' run.

In each iteration, I run 10,000 sub-iterations where I call HashCode.Combine, passing the result of the previous sub-iteration in as the first input value in the next iteration.

I then average all iterations to get the average elapsed time, further divide that by the number of sub-iterations run per loop to get the average time per call. I then compute the number of calls that can be made per-second and multiply that by the number of bytes combined to compute the actual throughput.

Will cleanup the code and share it in a bit.

morganbr commented 7 years ago

@tannergooding, that sounds like great progress. To make sure you're getting the right measurements, the intent of the API is that a call to HashCode.Combine(a, b) is equivalent to calling

HashCode hc = new HashCode();
hc.Add(a); // Initializes the hash state, calls a.GetHashCode() and feeds the result into the hash state
hc.Add(b); // Calls b.GetHashCode() and feeds the result into the hash state
return hc.ToHashCode(); // Finalizes the hash state, truncates it to an int, resets the internal state and returns the int

In both cases, the data should get fed into the same internal hash state and the hash should be finalized once at the end.

tannergooding commented 7 years ago

👍

That is effectively what the code I wrote is doing. The only difference is that I effectively inline all the code (there is no need to allocate new HashCode() and track the number of bytes combined since it is constant).

tannergooding commented 7 years ago

@morganbr. Implementation + Throughput test for Murmur3: https://gist.github.com/tannergooding/89bd72f05ab772bfe5ad3a03d6493650

tannergooding commented 7 years ago

MurmurHash3 is based off the algorithm described here: https://github.com/aappleby/smhasher/wiki/MurmurHash3, repo says it is MIT

Working on xxHash32 (BSD-2 Clause -- https://github.com/Cyan4973/xxHash/blob/dev/xxhash.c) and SpookyHash (Public Domain -- http://www.burtleburtle.net/bob/hash/spooky.html) variants

jamesqo commented 7 years ago

@tannergooding Again, not a hash expert but I remembered [reading an article][1] that said Murmur was not DoS-resistant, so just pointing that out before we choose that.

tannergooding commented 7 years ago

@jamesqo, I might be wrong, but I'm fairly certain that vulnerability applied to Murmur2 and not Murmur3.

In either case, I am implementing several algorithms so that we can get throughput results for C#. The distribution and other properties of these algorithms are fairly well known so we can pick and choose which is the best later 😄

jamesqo commented 7 years ago

Whoops, forgot to link to the article: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/.

@tannergooding OK. Sounds fair :+1:

morganbr commented 7 years ago

@tannergooding, I took a look at your Murmur3 implementation and it generally looks right and probably pretty well optimized. To make sure I understand correctly, are you using the fact that combinedValue and the internal state of Murmur are both 32 bits? That's probably a pretty good optimization for this case and explains some of my earlier confusion.

If we were to adopt it, it might need a couple tweaks (they probably won't make a big difference to perf measurements though):

jnm2 commented 7 years ago

In the meantime while I'm longing for this API, how bad is it for me to be implementing GetHashCode via (field1, field2, field3).GetHashCode()?

morganbr commented 7 years ago

@jnm2 , the ValueTuple hash code combiner tends to put your inputs in order in the hash code (and discard the least recent ones). For a couple of fields and a hash table that divides by a prime number, you might not notice. For a lot of fields or a hash table that divides by a power of two, the entropy of the last field you insert will have the most influence over whether you have collisions (e.g. if your last field is a bool or a small int, you'll probably have a lot of collisions, if it's a guid, you probably won't).

tannergooding commented 7 years ago

ValueTuple also doesn't work well with fields that are all 0.

tannergooding commented 7 years ago

On a side note, I had to stop working on other implementations (have higher priority work). Not sure when I'll be able to pick it back up.

jnm2 commented 7 years ago

So if that's not good enough for a structured type, why is it good enough for a tuple?