jamesqo commented 7 years ago

Update 6/16/17: Looking for volunteers

The API shape has been finalized. However, we're still deciding on the best hash algorithm out of a list of candidates to use for the implementation, and we need someone to help us measure the throughput/distribution of each algorithm. If you'd like to take that role up, please leave a comment below and @karelz will assign this issue to you.

Update 6/13/17: Proposal accepted!

Here's the API that was approved by @terrajobst at https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321:

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        [EditorBrowsable(Never)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

The original text of this proposal follows.

Rationale

Generating a good hash code should not require use of ugly magic constants and bit twiddling on our code. It should be less tempting to write a bad-but-concise GetHashCode implementation such as

class Person
{
    public override int GetHashCode() => FirstName.GetHashCode() + LastName.GetHashCode();
}

Proposal

We should add a HashCode type to enscapulate hash code creation and avoid forcing devs to get mixed up in the messy details. Here is my proposal, which is based off of https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329, with a few minor revisions.

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);
        public void AddRange<T>(T[] values);
        public void AddRange<T>(T[] values, int index, int count);
        public void AddRange<T>(T[] values, int index, int count, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

Remarks

See @terrajobst's comment at https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329 for the goals of this API; all of his remarks are valid. I would like to point out these ones in particular, however:

The API does not need to produce a strong cryptographic hash
The API will provide "a" hash code, but not guarantee a particular hash code algorithm. This allows us to use a different algorithm later or use different algorithms on different architectures.
The API will guarantee that within a given process the same values will yield the same hash code. Different instances of the same app will likely produce different hash codes due to randomization. This allows us to ensure that consumers cannot persist hash values and accidentally rely on them being stable across runs (or worse, versions of the platform).

AlexRadch commented 7 years ago

Proposal: add hash randomization support

public static HashCode Randomized<T> { get; } // or CreateRandomized<T>
or 
public static HashCode Randomized(Type type); // or CreateRandomized(Type type)

T or Type type is needed to get the same randomized hash for the same type.

AlexRadch commented 7 years ago

Proposal: add support for collections

public HashCode Combine<T>(T[] values);
public HashCode Combine<T>(T[] values, IEqualityComparer<T> comparer);
public HashCode Combine<T>(Span<T> values);
public HashCode Combine<T>(Span<T> values, IEqualityComparer<T> comparer);
public HashCode Combine<T>(IEnumerable<T> values);
public HashCode Combine<T>(IEnumerable<T> IEqualityComparer<T> comparer);

AlexRadch commented 7 years ago

I think there is no need in overloads Combine(_field1, _field2, _field3, _field4, _field5) because next code HashCode.Empty.Combine(_field1).Combine(_field2).Combine(_field3).Combine(_field4).Combine(_field5); should be inline optimized without Combine calls.

jamesqo commented 7 years ago

@AlexRadch

Proposal: add support for collections

Yes, that was part of my eventual plan for this proposal. I think it's important to focus on how we want the API to look like before we go about adding those methods, though.

CyrusNajmabadi commented 7 years ago

He wanted to use a different algorithm, like the Marvin32 hash which is used for strings in coreclr. This would require expanding the size of HashCode to 8 bytes.

What about having Hash32 and Hash64 types that would internally store 4 or 8 bytes worth of data? Document the pros/cons of each. Hash64 being good for X, but being potentially slower. Hash32 being faster, but potentially not as distributed (or whatever the tradeoff actually is).

He wanted to randomize the hash seed, so hashes would not be deterministic.

This seems like useful behavior. But i could see people wanting to control this. So perhaps there should be two ways to create the Hash, one that takes no seed (and uses a random seed) and one that allows the seed to be provided.

CyrusNajmabadi commented 7 years ago

Note: Roslyn would love if this could be provided in the Fx. We're adding a feature to spit out a GetHashCode for the user. Currently, it generates code like:

        public override int GetHashCode()
        {
            var hashCode = -1923861349;
            hashCode = hashCode * -1521134295 + this.b.GetHashCode();
            hashCode = hashCode * -1521134295 + this.i.GetHashCode();
            hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(this.s);
            return hashCode;
        }

This is not a great experience, and it exposes many ugly concepts. We would be thrilled to have a Hash.Whatever API that we could call through instead.

Thanks!

tannergooding commented 7 years ago

What about MurmurHash? It is reasonably fast and has very good hashing properties. There is also two different implementations, one that spits out 32-bit hashes and another that spits out 128-bit hashes.

tannergooding commented 7 years ago

There is also vectorized implementations for both the 32-bit.and 128-bit formats.

jamesqo commented 7 years ago

@tannergooding MurmurHash is fast, but not secure, from the sounds of this blog post.

jamesqo commented 7 years ago

@jkotas, has there been any work in the JIT around generating better code for >4-byte structs on 32-bit since our discussions last year? Also, what do you think of @CyrusNajmabadi's proposal:

What about having Hash32 and Hash64 types that would internally store 4 or 8 bytes worth of data? Document the pros/cons of each. Hash64 being good for X, but being potentially slower. Hash32 being faster, but potentially not as distributed (or whatever the tradeoff actually is).

I still think this type would be very valuable to offer to developers and it would be great to have it in 2.0.

tannergooding commented 7 years ago

@jamesqo, I don't think this implementation needs to be cryptographically secure (that is the purpose of the explicit cryptographically hashing functions).

Also, that article applies to Murmur2. The issue has been resolved in the Murmur3 algorithm.

jkotas commented 7 years ago

the JIT around generating better code for >4-byte structs on 32-bit since our discussions last year

I am not aware of any.

what do you think of @CyrusNajmabadi's proposal

The framework types should be simple choices that work well for 95%+ of cases. They may not be the fastest ones, but that's fine. Having you to choose between Hash32 and Hash64 is not a simple choice.

CyrusNajmabadi commented 7 years ago

That's fine with me. But can we at least have a good-enough solution for those 95% cases? Right now there's nothing... :-/

jkotas commented 7 years ago

hashCode = hashCode * -1521134295 + EqualityComparer.Default.GetHashCode(this.s);

@CyrusNajmabadi Why are you calling EqualityComparer here, and not just this.s.GetHashCode()?

CyrusNajmabadi commented 7 years ago

For non-structs: so that we don't need to check for null.

This is close to what we generate for anonymous types behind the scenes as well. I optimize the case of known non-null values to generate code that would be more pleasing to users. But it would be nice to just have a built in API for this.

jkotas commented 7 years ago

The call to EqualityComparer.Default.GetHashCode is like 10x+ more expensive than check for null... .

CyrusNajmabadi commented 7 years ago

The call to EqualityComparer.Default.GetHashCode is like 10x+ more expensive than check for null..

Sounds like a problem. if only there were good hash code API we could call in the Fx that i could defer to :)

(also, we have that problem then in our anonymous types as that's what we generate there as well).

Not sure what we do for tuples, but i'm guessing it's similar.

jkotas commented 7 years ago

Not sure what we do for tuples, but i'm guessing it's similar.

System.Tuple goes through EqualityComparer<Object>.Default for historic reasons. System.ValueTuple calls Object.GetHashCode with null check - https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/ValueTuple.cs#L809.

CyrusNajmabadi commented 7 years ago

Oh no. Looks like tuple can just use "HashHelpers". Could that be exposed so that users can get the same benefit?

CyrusNajmabadi commented 7 years ago

Great. I'm happy to do something similar. I started from our anonymous types because i figured they were reasonable best practices. If not, that's fine. :)

But that's not why i'm here. I'm here to get some system that actually combines the hashes effectively. If/when that can be provided we'll gladly move to calling into that instead of hardcoding in random numbers and combining hash values ourselves.

jkotas commented 7 years ago

What would be the API shape that you think would work best for the compiler generated code?

CyrusNajmabadi commented 7 years ago

Literally any of the 32bit solutions that were presented earlier would be fine with me. Heck, 64bit solutions are fine with me. Just some sort of API that you can get that says "i can combine hashes in some sort of reasonable fashion and produce a reasonably distributed result".

CyrusNajmabadi commented 7 years ago

I can't reconcile these statements:

We had an immutable HashCode struct that was 4 bytes in size. It had a Combine(int) method, which mixed in the provided hash code with its own hash code via a DJBX33X-like algorithm, and returned a new HashCode.

@jkotas did not think the DJBX33X-like algorithm was robust enough.

And

The framework types should be simple choices that work well for 95%+ of cases.

Can we not come up with a simple 32bit accumulating hash that works well enough for 95% of cases? What are the cases that aren't handled well here, and why do we think they're in the 95% case?

jamesqo commented 7 years ago

@jkotas, is performance really that critical for this type? I think on average things like hashtable lookups and this would take up way more time than a few struct copies. If it does turn out to be a bottleneck, would it be reasonable to ask the JIT team to optimize 32-bit struct copies after the API is released so they have some incentive, rather than blocking this API on that when nobody is working on optimizing copies?

jkotas commented 7 years ago

Can we not come up with a simple 32bit accumulating hash that works well enough for 95% of cases?

We have been burnt really badly by default 32bit accumulating hash for strings, and that's why Marvin hash for strings in .NET Core - https://github.com/dotnet/corert/blob/87e58839d6629b5f90777f886a2f52d7a99c076f/src/System.Private.CoreLib/src/System/Marvin.cs#L25. I do not think we want to repeat same mistake here.

@jkotas, is performance really that critical for this type?

I do not think the performance is critical. Since it looks like that this API is going to be used by auto-generated compiler code, I think we should be preferring smaller generated code over how it looks. The non-fluent pattern is smaller code.

CyrusNajmabadi commented 7 years ago

We have been burnt really badly by default 32bit accumulating hash for string

That doesn't seem like the 95% case. We're talking about normal developers just wanting a "good enough" hash for all those types where they manually do things today.

Since it looks like that this API is going to be used by auto-generated compiler code, I think we should be preferring smaller generated code over how it looks. The non-fluent pattern is smaller code.

This is not for use by the Roslyn compiler. This is for use by the Roslyn IDE when we help users generate GetHashCodes for their types. THis is code that the user will see and have to maintain, and having something sensible like:

   return Hash.Combine(this.A?.GetHashCode() ?? 0,
                       this.B?.GetHashCode() ?? 0,
                       this.C?.GetHashCode() ?? 0);

is a lot nicer than a user seeing and having to maintain:

            var hashCode = -1923861349;
            hashCode = hashCode * -1521134295 + this.b.GetHashCode();
            hashCode = hashCode * -1521134295 + this.i.GetHashCode();
            hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(this.s);
            return hashCode;

CyrusNajmabadi commented 7 years ago

I mean, we already have this code in the Fx:

https://github.com/dotnet/roslyn/blob/master/src/Compilers/Test/Resources/Core/NetFX/ValueTuple/ValueTuple.cs#L5

We think it's good enough for tuples. It's unclear to me why it would be such a problem to make it available for users who want it for their own types.

Note: we've even considered doing this in roslyn:

return (this.A, this.B, this.C).GetHashCode();

But now you're forcing people to generate a (potentially large) struct just to get some sort of reasonable default hashing behavior.

jkotas commented 7 years ago

We're talking about normal developers just wanting a "good enough" hash for all those types where they manually do things today.

The original string hash was a "good enough" hash that worked well for normal developers. But then it was discovered that ASP.NET webservers were vulnerable to DoS attacks because they tend to store received stuff in hashtables. So the "good enough" hash basically turned into a bad security issue.

We think it's good enough for tuples

No necessarily. We made a back stop measure for tuples to make the hashcode randomized that gives us option to modify the algorithm later.

jkotas commented 7 years ago

     return Hash.Combine(this.A?.GetHashCode() ?? 0,
                         this.B?.GetHashCode() ?? 0,
                         this.C?.GetHashCode() ?? 0);

This looks reasonable to me.

CyrusNajmabadi commented 7 years ago

I don't get your positoin. You seem to be saying two things:

The original string hash was a "good enough" hash that worked well for normal developers. But then it was discovered that ASP.NET webservers were vulnerable to DoS attacks because they tend to store received stuff in hashtables. So the "good enough" hash basically turned into a bad security issue.

Ok, if that's the case, then let's provide a hash code that's good for people who have security/DoS concerns.

The framework types should be simple choices that work well for 95%+ of cases.

Ok, if that's the case, then let's provide a hash code that's good enough for the 95% of cases. People who have security/DoS concerns can use the specialized forms that are documented for that purpose.

No necessarily. We made a back stop measure for tuples to make the hashcode randomized that gives us option to modify the algorithm later.

Ok. Can we expose that so that users can use that same mechanism.

-- I'm really struggling here because it sounds like we're saying "because we can't make a universal solution, everyone has to roll their own". That seems like one of hte worst places to be in. Because certainly most of our customers aren't thinking about rolling their own 'marvin hash' for DoS concerns. They're just adding, xoring, or otherwise poorly combining field hashes into one final hash.

If we care about the 95% case, then we should just make a generally good enogh hash. IF we care about the 5% case, we can supply a specialized solution for that.

CyrusNajmabadi commented 7 years ago

This looks reasonable to me.

Great :) Can we then expose:

namespace System.Numerics.Hashing
{
    internal static class HashHelpers
    {
        public static readonly int RandomSeed = new Random().Next(Int32.MinValue, Int32.MaxValue);

        public static int Combine(int h1, int h2)
        {
            // RyuJIT optimizes this to use the ROL instruction
            // Related GitHub pull request: dotnet/coreclr#1830
            uint rol5 = ((uint)h1 << 5) | ((uint)h1 >> 27);
            return ((int)rol5 + h1) ^ h2;
        }
    }

Roslyn could then generate:

     return Hash.Combine(Hash.RandomSeed,
                         this.A?.GetHashCode() ?? 0,
                         this.B?.GetHashCode() ?? 0,
                         this.C?.GetHashCode() ?? 0);

This would have the benefit of really being "good enough" for the vast majority of cases, while also leading people down the good path of initializing with random values so they don't take dependencies on non-random hashes.

jkotas commented 7 years ago

People who have security/DoS concerns can use the specialized forms that are documented for that purpose.

Every ASP.NET app has security/DoS concern.

jkotas commented 7 years ago

Great :) Can we then expose:

This is different from what I have said is reasonable.

What do you think about https://github.com/aspnet/Common/blob/dev/shared/Microsoft.Extensions.HashCodeCombiner.Sources/HashCodeCombiner.cs . It is what is used in ASP.NET internally in number of places today, and it is what I would be pretty happy with (except that the combining function needs to be stronger - implementation detail that we can keep tweaking).

blowdart commented 7 years ago

@jkotas I heard that :p

So the problem here is developers don't know when they're susceptible to DoS attacks, because it's not something they thing about it, which is why we switched strings to use Marvin32.

We should not head down the route of saying "95% of the cases don't matter", because we have no way to prove that, and we must err on the side of caution even when it has a performance cost. If you're going to move away from that then the hash code implementation needs Crypto Board review, not just us deciding "This looks good enough".

CyrusNajmabadi commented 7 years ago

Every ASP.NET app has security/DoS concern.

Ok. So how are you dealing with teh issue today that no one has any help with hashcodes, and thus is likely doing things poorly? Clearly it's been acceptable to have that state of the world. So what is harmed by providing a reasonable hashing system that likely performs better than what people are hand rolling today?

because we have no way to prove that, and we must err on the side of caution even when it has a performance cost

If you don't provide something, people will continue to just do things badly. The rejection of the "good enough" because there's nothing perfect just means the poor status quo we have today.

Every ASP.NET app has security/DoS concern.

Can you explain this? As i understand it, you have a DoS concern if you're accepting arbitrary input and then storing it in some data structure that performs poorly if the inputs can be specially crafted. Ok, i get how that's a concern with the strings one gets in web scenarios that have come from the user.

So how does that apply to the remainder of types out there that are not being used in this scenario?

We have these sets of types:

User types that need to be DoS safe. Right now we don't supply anything to help out, so we're already in a bad place as people are likely not doing the right thing.
User types that don't need to be DoS safe. Right now we don't supply anything to help out, so we're already in a bad place as people are likely not doing the right thing.
Framework types that need to be DoS safe. Right now we've made them DoS safe, but through APIs we don't expose.
Framework tyeps that don't need to be DoS safe. Right now we've given them hashes, but through APIs we don't expose.

Basically, we think these cases are important, but not important enough to actually provide a solution to users to handle '1' or '2'. Because we're worried a solution for '2' won't be good for '1' we won't even provide it in the first place. And if we're not willing to even provide a solution for '1' it feels like we're in an incredibly strange position. We're worried about DoSing and ASP, but not worried enogh to actually help people. And because we won't help people with that, we're not even willing to help then with the non-DoS cases.

--

If these two cases are important (which i'm willing to accept) then why not just give two APIs? Document them. Make them clear what they're for. If people use them properly, great. If people don't use them properly that's still fine. After all, they're likely not doing things properly today anyways, so how are things any worse?

CyrusNajmabadi commented 7 years ago

What do you think about

I have no opinion one way or the other. If it's an API that customers can use which performs acceptably and which provides a simple API with clear code on their end, then i think that's fine.

I think it would be nice to have a simple static form that handles the 99% case of wanting to combine a set of fields/properties in an ordered fashion. It seems like such a thing could be added to this type fairly simply.

jkotas commented 7 years ago

I think it would be nice to have a simple static form

Agree.

jamesqo commented 7 years ago

I think it would be nice to have a simple static form that handles the 99% case of wanting to combine a set of fields/properties in an ordered fashion. It seems like such a thing could be added to this type fairly simply.

Agree.

I am willing to meet you both halfway on this one because I really want to see some sort of API come through. @jkotas I still do not understand you're opposed to adding a immutable instance-based API; first you said it was because 32-bit copies would be slow, then because the mutable API would be more terse (which is not true; h.Combine(a).Combine(b) (immutable version) is shorter than h.Combine(a); h.Combine(b); (mutable version)).

That said, I'm willing to go back to:

public static class HashCode
{
    public static int Combine<T>(T value1, Tvalue2);
    public static int Combine<T>(T value1, Tvalue2, IEqualityComparer<T> comparer);
    public static int Combine<T>(T value1, Tvalue2, T value3);
    public static int Combine<T>(T value1, Tvalue2, T value3, IEqualityComparer<T> comparer);
    public static int Combine<T>(T value1, Tvalue2, T value3, T value4);
    public static int Combine<T>(T value1, Tvalue2, T value3, T value4, IEqualityComparer<T> comparer);
    // ... All the way until value8
}

Does this seem reasonable?

jamesqo commented 7 years ago

I can't edit my post right now, but I just realized not all methods can accept T. In that case, we can just have 8 overloads accepting all ints and force the user to call GetHashCode.

blowdart commented 7 years ago

If these two cases are important (which i'm willing to accept) then why not just give two APIs? Document them. Make them clear what they're for. If people use them properly, great. If people don't use them properly that's still fine. After all, they're likely not doing things properly today anyways, so how are things any worse?

Because people don't use things properly when they're there. Let's take a simple example, XSS. From the beginning even web forms had the ability to HTML encode output. However developers didn't know the risk, didn't know how to do it properly, and only found out when it was too late, their app was published, and oops, now their auth cookie has been lifted.

Giving people a security choice assumes they

Know about the problem.
Understand what the risks are.
Can evaluate those risks.
Can easily discover the right thing to do.

Those assumptions don't generally hold for the majority of developers, they only find out about the problem when it's too late. Developers don't go to security conferences, they don't read white papers and they don't understand the solutions. So in the ASP.NET HashDoS scenario we made the choice for them, we protected them by default, because that was the right thing to do, and had the greatest impact. However we only applied it to strings, and that left people who were constructing custom classes from user input in a bad place. We should do the right thing, and help protect those customers now, and make it the default, having a pit of success, not failure. API design for security is sometimes not about choice, but about helping the user whether they know it or not.

benaadams commented 7 years ago

A user can always create a non-security focused hash; so given the two options

Default hash utility is non-security aware; user can create a security aware hash function
Default hash utility is security aware; user can create a custom non-security aware hash function

Then the second is probably better; and what's suggested wouldn't have the perf impact of a full on crypto hash; so it makes a good compromise?

morganbr commented 7 years ago

One of the running questions in these threads has been which algorithm is perfect for everybody. I think it's safe to say there isn't a single perfect algorithm. However, I don't think that should stop us from providing something better than code like what @CyrusNajmabadi has shown, which tends to have poor entropy for common .NET inputs as well as other common hasher bugs (like losing input data or being easily resettable).

I'd like to propose a couple of options to get around the "best algorithm" problem:

Explicit Choices: I'm planning to send out an API proposal soonish for a suite of non-cryptographic hashes (perhaps xxHash, Marvin32, and SpookyHash for example). Such an API has slightly different usage than a HashCode or HashCodeHelper type, but for the sake of discussion, assume we can work out those differences. If we use that API for GetHashCode:
- The generated code is explicit about what it's doing -- if Roslyn generates Marvin32.Create();, it lets power users know what it decided to do and they can easily change it to another algorithm in the suite if they like.
- It means we don't have to worry about breaking changes. If we start with a non-randomizing/poor entropy/slow algorithm, we can simply update Roslyn to start generating something else in new code. Old code will keep using the old hash and new code will use the new hash. Developers (or a Roslyn code fix) can change the old code if they want to.
- The biggest downside I can think of is that some of the optimizations we might want for GetHashCode could be detrimental for other algorithms. For example, while a 32-bit internal state works nicely with immutable structs, a 256-bit internal state in (say) CityHash might waste a bunch of time copying.
Randomization: Start with a properly randomized algorithm (the code @CyrusNajmabadi showed with a random initial value doesn't count since it's likely possible to wash out the randomness). This ensures that we can change the implementation with no compatibility issues. We would still need to be very sensitive about performance changes if we change the algorithm. However that would also be a potential upside as we could make per-architecture (or even per-device) choices. For example, this site shows that xxHash is fastest on an x64 Mac while SpookyHash is fastest on Xbox and iPhone. If we do go down this route with an intent to change algorithms at some point, we may need to think about designing an API that still has reasonable performance if there is 64+ bit internal state.

CC @bartonjs, @terrajobst

svick commented 7 years ago

@morganbr There isn't a single perfect algorithm, but I think that having some algorithm, which works fairly well most of the time, exposed using a simple, easy to understand API is the most useful thing that can be done. Having a suite of algorithms in addition to that, for advanced uses is fine. But it shouldn't be the only option, I shouldn't have to learn who Marvin is just so that I can put my objects into a Dictionary.

CyrusNajmabadi commented 7 years ago

I shouldn't have to learn who Marvin is just so that I can put my objects into a Dictionary.

I like the way you put that. I also like that you mentioned Dictionary itself. IDictionary is something that can have tons of different impls with all sorts of differing qualities (see the collections APIs in many platforms). However, we still just provide a base 'Dictionary' that does a decent job overall, even though it may not excel in every category.

I think that's what a ton of people are looking for in a hashing library. Something that gets the job done, even if it is not perfect for every purpose.

CyrusNajmabadi commented 7 years ago

@morganbr I think people simple want a way to write GetHashCode that is better than what they're doing today (usually some grabbag combination of math operations they copied from something on the web). If you can just provide a basic impl of that that runes well, then people will be happy. You can then have a behind-the-scenes API for advanced users if they have a strong need for specific hashing functions.

In other words, people writing hashcodes today aren't going to know or care why they would want Spooky vs Marvin vs Murmur. Only someone who has a particular need for one of those specific hash codes would go looking. But lots of people have a need to say "here's the state of my object, provide me a way to produce a well distributed hash that is fast that i can then use with dictionaries, and which i guess prevents me from being DOSed if i happen to take untrusted input and hash it and store it".

bartonjs commented 7 years ago

@CyrusNajmabadi The problem is that if we extend our current notions of compatibility into the future we find that once this type ships it can't ever change (unless we find that the algorithm is horribly broken in an "it makes all applications attackable" manner).

Once can argue that if it starts off as a stable-randomized manner that it becomes easy to change the implementation, since you couldn't depend on the value from run to run anyways. But if a couple of years later we find that there's an algorithm that provides as-good-if-not-better balancing of hash buckets with better-in-the-general case performance, but makes a structure involving a List\<string> of 1000 or more members where each member is over 900 characters long get significantly worse, we probably won't make the change... even though it would on the net (of all programs ever run) reduce the number of CPU-hours spent hashing.

Under Morgan's suggestion is that the code that you write today will have effectively the same performance characteristics forever. For the applications which could have gotten better, this is unfortunate. For the applications which would have gotten worse, this is fantastic. But when we find the new algorithm we get it checked in, and we change Roslyn (and suggest a change to ReSharper/etc) to start generating things with NewAwesomeThing2019 instead of SomeThingThatWasConsideredAwesomeIn2018.

Anything super black box like this only ever gets to be done once. And then we're stuck with it forever. Then someone writes the next one, which has better average performance, so there are two black box implementations that you don't know why you'd choose between them. And then... and then....

So, sure, you may not know why Roslyn/ReSharper/etc auto-wrote GetHashCode for you using Marvin32, or Murmur, or FastHash, or a combination/conditional based on IntPtr.Size. But you have the power to look into it. And you have the power to change it on your types later, as new information is revealed... but we've also given you the power to keep it the same. (It'd be sad if we write this, and in 3 years Roslyn/ReSharper/etc are explicitly avoiding calling it, because the new algorithm is So Much Better... Usually).

svick commented 7 years ago

@bartonjs What makes hashing different from all the places where .Net provides you with black box algorithm or data structure? For example, sorting (introsort), Dictionary (array-based separate chaining), StringBuilder (linked list of 8k chunks), most of LINQ.

terrajobst commented 7 years ago

We've taken a deeper look at this today. Apologies for the delay and the back and forth on this issue.

Requirements

Who is the API for?
- The API does not need to produce a strong cryptographic hash
- But: the API needs to be good enough so that we can use it in the framework itself (e.g. in the BCL and ASP.NET)
- However, this doesn't mean that we have to use the API everywhere. It's OK if there are parts of the FX where we want to use a custom one either for security/DOS risks or because of performance. Exceptions will always exist.
What's the desired properties of this hash?
- All bits in the input are used
- The result is well distributed
- The API will provide "a" hash code, but not guarantee a particular hash code algorithm. This allows us to use a different algorithm later or use different algorithms on different architectures.
- The API will guarantee that within a given process the same values will yield the same hash code. Different instances of the same app will likely produce different hash codes due to randomization. This allows us to ensure that consumers cannot persist hash values and accidentally rely on them being stable across runs (or worse, versions of the platform).

API Shape

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);
        public void Add<T>(T[] value);
        public void Add<T>(T[] value, int index, int length);
        public void Add(byte[] value);
        public void Add(byte[] value, int index, int length);
        public void Add(string value);
        public void Add(string value, StringComparison comparisonType);

        public int ToHashCode();
    }
}

Notes:

We decided to not override GetHashCode() to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (GetHashCode() should return the object's hash code, not the one being computed).
We decided to use Add for the builder patter and Combine for the static construction
We decided to use not provide a static initialization method. Instead, Add will do this on first use.
The struct is mutable, which is unfortunate but we feel the best compromise between making GetHashCode() very cheap & not cause any allocations while allowing the structure to be bigger than 32-bit so that the hash code algorithm can use more bits during accumulation.
Combine will just call <value>.GetHashCode(), so it has the behavior of the value's type GetHashCode() implementation
- For strings that means different casing will produce different hash codes
- For arrays, that means the hash code doesn't look at the contents but uses reference semantics for the hash code
- If that behavior is undesired, the developer needs to use the builder-style approach

Usage

The simple case is when someone just wants to produce a good hash code for a given type, like so:

public class Customer
{
    public int Id { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }

    public override int GetHashCode() => HashCode.Combine(Id, FirstName, LastName);
}

The more complicated case is when the developer needs to tweak how the hash is being computed. The idea is that the call site passes the desired hash rather then the object/value, like so:

public partial class Customer
{
    public override int GetHashCode() =>
        HashCode.Combine(
            Id,
            StringComparer.OrdinalIgnoreCase.GetHashCode(FirstName),
            StringComparer.OrdinalIgnoreCase.GetHashCode(LastName),
        );
}

And lastly, if the developer needs more flexibility, such as producing a hash code for more than eight values, we also provide a builder-style approach:

public partial class Customer
{
    public override int GetHashCode()
    {
        var hashCode = new HashCode();
        hashCode.Add(Id);
        hashCode.Add(FirstName, StringComparison.OrdinalIgnoreCase);
        hashCode.Add(LastName, StringComparison.OrdinalIgnoreCase);
        return hashCode.ToHashCode();
    }
}

Next Steps

This issue will remain up for grabs. In order to implement the API we need to decide which algorithm to use.

@morganbr will make a proposal for good candidates. Generally speaking, we don't want to write a hashing algorithm from scratch -- we want to use a well-known one whose properties are well-understood.

However, we should measure the implementation for typical .NET workloads and see which algorithm produces good results (throughput and distribution). It's likely that the answers will differ by CPU architecture, so we should consider this when measuring.

@jamesqo, are you still interested on working in this area? In that case, please update the proposal accordingly.

morganbr commented 7 years ago

@terrajobst , we might also want public static int Combine<T1>(T1 value);. I know it looks a little funny, but it would provide a way of diffusing bits from something with a limited input hash space. For example, many enums only have a few possible hashes, only using the bottom few bits of the code. Some collections are built on the assumption that hashes are spread over a larger space, so diffusing the bits may help the collection work more efficiently.

justinvp commented 7 years ago

public void Add(string value, StrinComparison comparison);

Nit: The StringComparison parameter should be named comparisonType to match the naming used everywhere else StringComparison is used as a parameter.

dotnet / runtime

Proposal: Add System.HashCode to make it easier to generate good hash codes. #19621

Update 6/16/17: Looking for volunteers

Update 6/13/17: Proposal accepted!

Rationale

Proposal

Remarks

Requirements

API Shape

Usage

Next Steps