Closed jamesqo closed 4 years ago
The criteria that would help us choose algorithms would be:
What we'd really like to see is performance numbers for candidates written in C# so that we can be reasonably confident their characteristics will hold up for .NET. If you write a candidate and we don't pick it for this, that will still be useful work whenever I actually get the API proposal together for the non-cryptographic hash API.
Here are some candidates I think are worth evaluating (but feel free to propose others):
Shame the Add
methods can't have a return type of ref HashCode
and return ref this
so they can be used in a fluent manner,
Would readonly ref
returns allow this? /cc @jaredpar @VSadov
WARNING: If anyone picks hash implementation from existing code base somewhere on the internet, please keep the link to the source and check the license (we will have to do it as well).
If the license is not compatible, we may need to write the algorithm from scratch.
IMO, using the Add methods should be exceedingly uncommon. It will be for very advanced scenarios, and the need to be able to be 'fluent' will not really be there.
For the common use cases for 99% of all user code cases, one should be able to juse use => HashCode.Combine(...)
and be fine.
@morganbr
we might also want
public static int Combine<T1>(T1 value);
. I know it looks a little funny, but it would provide a way of diffusing bits from something with a limited input hash space
Make sense. I've added it.
@justinvp
Nit: The
StringComparison
parameter should be namedcomparisonType
to match the naming used everywhere elseStringComparison
is used as a parameter.
Fixed.
@CyrusNajmabadi
IMO, using the
Add
methods should be exceedingly uncommon. It will be for very advanced scenarios, and the need to be able to be 'fluent' will not really be there.
Agreed.
@benaadams - re: ref returning this
from Add
- no, this
cannot be returned by ref in struct methods since it can be an rValue or a temp.
ref var r = (new T()).ReturnsRefThis();
// r refers to some variable here. Which one? What is the scope/lifetime?
r = SomethingElse();
In case it's useful for comparison purposes, some years ago I ported the Jenkins lookup3 hash function (C source) to C# here.
I'm wondering about collections:
@terrajobst
public void Add<T>(T[] value);
Why is there an overload for arrays, but not one for general collections (i.e. IEnumerable<T>
)?
Also, isn't it going to be confusing that HashCode.Combine(array)
and hashCode.Add((object)array)
behave one way (use reference equality) and hashCode.Add(array)
behaves another way (combines hash codes of the values in the array)?
@CyrusNajmabadi
For the common use cases for 99% of all user code cases, one should be able to just use
=> HashCode.Combine(...)
and be fine.
If the aim is really to be able to use Combine
in 99 % of use cases (and not, say, 80 %), then shouldn't Combine
somehow support hashing collections based on the values in the collection? Maybe there should be a separate method that does that (either an extension method or static method on HashCode
)?
If Add is a power scenario, should we assume the user should choose between Object.GetHashCode and combining individual elements of collections? If it would help, we could consider renaming the array (and potential IEnumerable) versions. Something like:
public void AddEnumerableHashes<T>(IEnumerable<T> enumerable);
public void AddEnumerableHashes<T>(T[] array);
public void AddEnumerableHashes<T>(T[] array, int index, int length);
I wonder whether we'd also need overloads with IEqualityComparers.
Proposal: Make the builder struct implement IEnumerable
to support collection initializer syntax:
return new HashCode {
SomeField,
OtherField,
{ SomeString, StringComparer.UTF8 },
{ SomeHashSet, HashSet<int>.CreateSetComparer() }
}.GetHashCode()
This is much more elegant than calling Add()
by hand (in particular, you don't need a temporary variable), and still has no allocations.
@SLaks Maybe that nicer syntax could wait for https://github.com/dotnet/csharplang/issues/455 (assuming that proposal had support), so that HashCode
wouldn't have to implement bogus IEnumerable
?
We decided to not override GetHashCode() to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (GetHashCode() should return the object's hash code, not the one being computed).
I find it strange that GetHashCode
isn't going to return the computed hashcode. I think this is going to confuse developers. For example, @SLaks already used it in his proposal instead of using ToHashCode
.
@justinvp If GetHashCode()
is not going to return the computed hash code, it should probably be marked [Obsolete]
and [EditorBrowsable(Never)]
.
On the other hand, I don't see the harm in it returning the computed hash code.
@terrajobst
We decided to not override
GetHashCode()
to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (GetHashCode()
should return the object's hash code, not the one being computed).
Yes, GetHashCode()
should return the object's hash code, but is there any reason why the two hash codes should be different? It's still correct, since two instances of HashCode
with the same internal state will return the same value from GetHashCode()
.
@terrajobst I just saw your comment. Forgive me for the delayed reply, I was slow to look into the notification because I thought it would only be more back-and-forth that was going nowhere. Glad to see that's not the case! :tada:
I'd be delighted to pick this up and do the throughput/distribution measuring (I assume that's what you meant by "interested in working on this area"). Give me a second to finish reading through all of the comments here, though.
@terrajobst
Can we change
public void Add<T>(T[] value);
public void Add<T>(T[] value, int index, int length);
public void Add(byte[] value);
public void Add(byte[] value, int index, int length);
to
public void AddRange<T>(T[] values);
public void AddRange<T>(T[] values, int index, int count);
public void AddRange<T>(T[] values, int index, int count, IEqualityComparer<T> comparer);
? I renamed Add
-> AddRange
to avoid the behavior @svick mentioned. I removed the byte
overloads since we can specialize using typeof(T) == typeof(byte)
inside the method if we need to do anything byte-specific. Also, I changed value
-> values
and length
-> count
. It also makes sense to have a comparer overload.
@terrajobst Can you remind me why
public void Add(string value);
public void Add(string value, StringComparison comparisonType);
is necessary when we have
public void Add<T>(T value);
public void Add<T>(T value, IEqualityComparer<T> comparer);
?
@svick
@justinvp If GetHashCode() is not going to return the computed hash code, it should probably be marked [Obsolete] and [EditorBrowsable(Never)].
:+1:
@terrajobst Can we go back to having an implicit conversion from edit: HashCode
-> int
, so no ToHashCode
method?ToHashCode
is fine. See @CyrusNajmabadi's response below.
@jamesqo StringComparison
is an enum.
However, people could use the equivalent StringComparer
instead.
Can we go back to having an implicit conversion from HashCode -> int, so no ToHashCode method?
We discussed this and decided against it in the meeting. The issue is that when the user gets the final 'int' that extra work is often done. i.e. the internals of the hashcode will often do a finalization step, and may reset itself to a fresh state. Having that happen with an implicit conversion would be strange. If you did this:
HashCode hc = ...
int i1 = hc;
int i2 = hc;
Then you could get different results.
For that reason, we also don't like the explicit conversion (as people don't think of conversions as changing internal state).
With a method we can document explicitly that this is happening. We can even potentially name it to convey it as much. i.e. "ToHashCodeAndReset" (though we decided against that). But at least the method can have clear documentation on it that hte user can see in things like intellisense. That's not really the case with conversions.
I removed the byte overloads since we can specialize using typeof(T) == typeof(byte)
IIRC there was some concern about this not being ok from the JIT perspective. But that may have only been for the non-value-type "typeof()" cases. As long as the jit will effectively do the right thing for the value-type typeof() cases, then that should be good.
@CyrusNajmabadi I was unaware that conversion to an int
might involve mutating state. ToHashCode
it is then.
For those thinking about the crypto perspective - http://tuprints.ulb.tu-darmstadt.de/2094/1/thesis.lehmann.pdf
@terrajobst, have you had time to read my comments (starting from here) and decide whether you approve of the tweaked API shape? If so, then I think this can be marked api-approved/up for grabs and we can start deciding on a hash algorithm.
@blowdart , any particular part of that you'd like to highlight?
I might not have been too explicit about it above, but the only non-cryptographic hashes I don't know of HashDoS breaks in are Marvin and SipHash. That is, even seeding (say) Murmur with a random value can still be broken and used for a DoS.
None, I just found it interesting, and I'm thinking the docs for this ought to say "Not for use on hash codes which are generated via cryptographic algorithms."
Decisions
AddRange
methods because the scenario is unclear. Array's are somewhat unlikely to show up very often. And once larger arrays are involved, the question is whether the computation should be cached. Seeing the for loop on the calling side makes it clear that you need to think about that.IEnumerable
overloads to AddRange
because they would allocated.Add
that takes string
and StringComparison
. Yes, those are likely more efficient than calling through the IEqualityComparer
, but we can fix this later.GetHashCode
as obsoltete with error is a good idea, but we'd go a step further and also hide from IntelliSense.This leaves us with:
// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core : System.Runtime / System.Private.CoreLib
namespace System
{
public struct HashCode
{
public static int Combine<T1>(T1 value1);
public static int Combine<T1, T2>(T1 value1, T2 value2);
public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);
public void Add<T>(T value);
public void Add<T>(T value, IEqualityComparer<T> comparer);
[Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
[EditorBrowsable(Never)]
public override int GetHashCode();
public int ToHashCode();
}
}
Next steps: The issue is up-for-grabs -- to implement the API we need with several candidate algorithms as experiments -- see https://github.com/dotnet/corefx/issues/14354#issuecomment-305028686 for list, so that we can decide which algorithm to take (based on throughput and distribution measurements, likely different answer per CPU architecture).
Complexity: Large
If anyone is interested in picking it up, please ping us. There might be even room for several people working on it together. (@jamesqo you have priority choice as you invested most & longest into the issue)
@karelz Despite my comment above, I have changed my mind because I do not think I have the qualifications to choose the best hash algorithm. I looked into some of the libraries @morganbr listed and realized the implementation is quite complex, so I can't easily translate it to C# to test for myself. I have little background in C++, so I would also have a hard time just installing the library and writing a test app.
I do not want this to remain on the up-for-grabs list forever, though. If no one takes it up a week from today, I will consider posting a question on Programmers SE or Reddit.
I've not benched it (or otherwise optimized it), but here is a basic implementation of the Murmur3 hash algorithm that I use in several of my personal projects: https://gist.github.com/tannergooding/0a12559d1a912068b9aeb4b9586aad7f
I feel like the most optimal solution here is going to be to dynamically change the hashing algorithm based on the size of the input data.
Ex: Mumur3 (and others) are very fast for large sets of data and provide great distribution, but they can perform 'poorly' (speed wise, not distribution wise) for smaller sets of data.
I imagine we should do something like: If overall byte count is less than X, do algorithm A; otherwise, do algorithm B. This will still be deterministic (per run), but will allow us to provide speed and distribution based on the actual size of the input data.
It's probably also worth noting that several of the algorithms mentioned have implementations specifically designed for SIMD instructions, so an the most performant solution would likely involve an FCALL at some level (like is done with some of the BufferCopy implementations) or may involve taking a dependency on System.Numerics.Vector
.
@jamesqo, we're happy to help out with making choices; what we need the most help with is performance data for candidate implementations (ideally C#, though as @tannergooding points out, some algorithms need special compiler support). As I mentioned above, if you build a candidate that isn't chosen, we'll probably use it later, so don't worry about work being wasted.
I know there are benchmarks out there for various implementations, but I think it's important to have a comparison using this API and a likely range of inputs (e.g. structs with 1-10 fields).
@tannergooding, that kind of adaptability might be most performant, but I don't see how it would work with the Add method since it doesn't know how many times it'll get called. While we could do it with Combine, that would mean a series of Add calls could produce a different result than the corresponding Combine call.
Also, given that the most likely range of inputs is 4-32 bytes (Combine`1
-Combine`8
), hopefully there aren't big performance changes over that range.
that kind of adaptability might be most performant, but I don't see how it would work with the Add method since it doesn't know how many times it'll get called.
I'm not personally convinced the API shape is quite right for general purpose hashing (it is close however)...
Currently we are exposing Combine
methods for static construction. If these are meant to combine all inputs and produce a finalized hash code, then the name is 'poor' and something like Compute
might be more appropriate.
If we are exposing Combine
methods, they should just mix all inputs and users should be required to call a Finalize
method which takes the output from the last combine as well as total number of bytes that were combined to produce a finalized hash code (finalizing a hash code is important as it is what causes the bits to avalanche).
For the builder pattern, we are exposing an Add
and ToHashCode
method. It is not clear if the Add
method is meant to store the bytes and only combine/finalize on the call to ToHashCode
(in which case we can pick the correct algorithm dynamically) or if they are meant to be combined on the fly, it should be clear that this is the case (and that the implementation should be internally tracking the total size of bytes combined).
For anyone looking for a less complicated starting point, try xxHash32. That's likely to translate pretty easily to C# (people have done it).
Still testing locally, but I am seeing the following throughput rates for my C# implementation of Murmur3.
These are for the static Combine methods for 1-8 inputs:
1070.18 mb/s
1511.49 mb/s
1674.89 mb/s
1957.65 mb/s
2083.24 mb/s
2140.94 mb/s
2190.27 mb/s
2245.53 mb/s
My implementation assumes that GetHashCode
should be called for each input and that the computed value should be finalized before being returned.
I combined int
values, as they are the simplest to test.
To compute the throughput, I ran 10,001 iterations, throwing away the first iteration as the 'warmup' run.
In each iteration, I run 10,000 sub-iterations where I call HashCode.Combine
, passing the result of the previous sub-iteration in as the first input value in the next iteration.
I then average all iterations to get the average elapsed time, further divide that by the number of sub-iterations run per loop to get the average time per call. I then compute the number of calls that can be made per-second and multiply that by the number of bytes combined to compute the actual throughput.
Will cleanup the code and share it in a bit.
@tannergooding, that sounds like great progress. To make sure you're getting the right measurements, the intent of the API is that a call to HashCode.Combine(a, b)
is equivalent to calling
HashCode hc = new HashCode();
hc.Add(a); // Initializes the hash state, calls a.GetHashCode() and feeds the result into the hash state
hc.Add(b); // Calls b.GetHashCode() and feeds the result into the hash state
return hc.ToHashCode(); // Finalizes the hash state, truncates it to an int, resets the internal state and returns the int
In both cases, the data should get fed into the same internal hash state and the hash should be finalized once at the end.
👍
That is effectively what the code I wrote is doing. The only difference is that I effectively inline all the code (there is no need to allocate new HashCode()
and track the number of bytes combined since it is constant).
@morganbr. Implementation + Throughput test for Murmur3: https://gist.github.com/tannergooding/89bd72f05ab772bfe5ad3a03d6493650
MurmurHash3 is based off the algorithm described here: https://github.com/aappleby/smhasher/wiki/MurmurHash3, repo says it is MIT
Working on xxHash32 (BSD-2 Clause -- https://github.com/Cyan4973/xxHash/blob/dev/xxhash.c) and SpookyHash (Public Domain -- http://www.burtleburtle.net/bob/hash/spooky.html) variants
@tannergooding Again, not a hash expert but I remembered [reading an article][1] that said Murmur was not DoS-resistant, so just pointing that out before we choose that.
@jamesqo, I might be wrong, but I'm fairly certain that vulnerability applied to Murmur2 and not Murmur3.
In either case, I am implementing several algorithms so that we can get throughput results for C#. The distribution and other properties of these algorithms are fairly well known so we can pick and choose which is the best later 😄
Whoops, forgot to link to the article: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/.
@tannergooding OK. Sounds fair :+1:
@tannergooding, I took a look at your Murmur3 implementation and it generally looks right and probably pretty well optimized. To make sure I understand correctly, are you using the fact that combinedValue and the internal state of Murmur are both 32 bits? That's probably a pretty good optimization for this case and explains some of my earlier confusion.
If we were to adopt it, it might need a couple tweaks (they probably won't make a big difference to perf measurements though):
In the meantime while I'm longing for this API, how bad is it for me to be implementing GetHashCode via (field1, field2, field3).GetHashCode()
?
@jnm2 , the ValueTuple hash code combiner tends to put your inputs in order in the hash code (and discard the least recent ones). For a couple of fields and a hash table that divides by a prime number, you might not notice. For a lot of fields or a hash table that divides by a power of two, the entropy of the last field you insert will have the most influence over whether you have collisions (e.g. if your last field is a bool or a small int, you'll probably have a lot of collisions, if it's a guid, you probably won't).
ValueTuple also doesn't work well with fields that are all 0.
On a side note, I had to stop working on other implementations (have higher priority work). Not sure when I'll be able to pick it back up.
So if that's not good enough for a structured type, why is it good enough for a tuple?
Update 6/16/17: Looking for volunteers
The API shape has been finalized. However, we're still deciding on the best hash algorithm out of a list of candidates to use for the implementation, and we need someone to help us measure the throughput/distribution of each algorithm. If you'd like to take that role up, please leave a comment below and @karelz will assign this issue to you.
Update 6/13/17: Proposal accepted!
Here's the API that was approved by @terrajobst at https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321:
The original text of this proposal follows.
Rationale
Generating a good hash code should not require use of ugly magic constants and bit twiddling on our code. It should be less tempting to write a bad-but-concise
GetHashCode
implementation such asProposal
We should add a
HashCode
type to enscapulate hash code creation and avoid forcing devs to get mixed up in the messy details. Here is my proposal, which is based off of https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329, with a few minor revisions.Remarks
See @terrajobst's comment at https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329 for the goals of this API; all of his remarks are valid. I would like to point out these ones in particular, however: