dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Proposal: Add System.HashCode to make it easier to generate good hash codes. #19621

Closed jamesqo closed 4 years ago

jamesqo commented 7 years ago

Update 6/16/17: Looking for volunteers

The API shape has been finalized. However, we're still deciding on the best hash algorithm out of a list of candidates to use for the implementation, and we need someone to help us measure the throughput/distribution of each algorithm. If you'd like to take that role up, please leave a comment below and @karelz will assign this issue to you.

Update 6/13/17: Proposal accepted!

Here's the API that was approved by @terrajobst at https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321:

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        [EditorBrowsable(Never)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

The original text of this proposal follows.

Rationale

Generating a good hash code should not require use of ugly magic constants and bit twiddling on our code. It should be less tempting to write a bad-but-concise GetHashCode implementation such as

class Person
{
    public override int GetHashCode() => FirstName.GetHashCode() + LastName.GetHashCode();
}

Proposal

We should add a HashCode type to enscapulate hash code creation and avoid forcing devs to get mixed up in the messy details. Here is my proposal, which is based off of https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329, with a few minor revisions.

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);
        public void AddRange<T>(T[] values);
        public void AddRange<T>(T[] values, int index, int count);
        public void AddRange<T>(T[] values, int index, int count, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

Remarks

See @terrajobst's comment at https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329 for the goals of this API; all of his remarks are valid. I would like to point out these ones in particular, however:

CyrusNajmabadi commented 6 years ago

Question for the participants here. The Roslyn IDE allows users to generate a GetHashCode impl based on a set of fields/properties in their class/struct . Ideally, people could use the new HashCode.Combine that was added in https://github.com/dotnet/corefx/pull/25013 . However, some users will not have access to that code. So, we'd like to still be able to generate a GetHashCode that will work for them.

Recently, it came to our attention that the form we generate is problematic. Namely, because VB compiles with overflow checks on by default, and our impl will cause overflows. Also, VB has no way to disable overflow checks for a region of code. It's either on or off entirely for the entire assembly.

Because of this, i'd love to be able to replace the impl we provide with a form that doesn't suffer from these problems. Ideally, the form generated would have the following properties:

  1. One/two lines in GetHashCode per field/property used.
  2. No overflowing.
  3. Reasonably good hashing. We're not expecting amazing results. But something that has hopefully already been vetted to be decent, and to not have the problems you usually get with a + b + c + d or a ^ b ^ c ^ d.
  4. No additional dependencies/requirements on the code.

For example, one option for VB would be to generate something like:

return (a, b, c, d).GetHashCode()

But this then depends on having a reference to System.ValueTuple. Ideally, we could have an impl that works even in the absence of that.

Does anyone know about a decent hashing algorithm that can work with these constraints? Thanks!

--

Note: our existing emitted code is:

        Dim hashCode = -252780983
        hashCode = hashCode * -1521134295 + i.GetHashCode()
        hashCode = hashCode * -1521134295 + j.GetHashCode()
        Return hashCode

This clearly can overflow.

This is also not a problem for C# as we can just add unchecked { } around that code. That fine-grained control is not possible in VB.

jamesqo commented 6 years ago

Does anyone know about a decent hashing algorithm that can work with these constraints? Thanks!

Well, you could do Tuple.Create(...).GetHashCode(). Obviously that incurs allocations, but it seems better than throwing an exception.

Is there any reason you can't just tell the user to install System.ValueTuple? Since it's a builtin language feature, I'm sure the System.ValueTuple package is very compatible with basically all platforms right?

CyrusNajmabadi commented 6 years ago

Obviously that incurs allocations, but it seems better than throwing an exception.

Yes. it would be nice to not have it cause allocations.

Is there any reason you can't just tell the user to install System.ValueTuple?

That would be the behavior if we generate the ValueTuple approach. However, again, it would be nice if we could just generate something good that fits with the way the user has currently structured their code, without making them change their structure in a heavyweight way.

It really does seem like VB users should have a way to address this problem in a reasonable manner :) But such an approach is eluding me :)

morganbr commented 6 years ago

@CyrusNajmabadi, If you really need to do your own hash calculation in the user's code, CRC32 might work since it's a combination of table lookups and XORs (but not arithmetic that can overflow). There are some drawbacks though:

  1. CRC32 doesn't have great entropy (but it's likely still better than what Roslyn emits now).
  2. You'd need to put a 256 entry lookup table somewhere in the code or emit code to generate the lookup table.

If you're not doing it already, I'd hope you can detect the HashCode type and use that when possible since XXHash should be much better.

CyrusNajmabadi commented 6 years ago

@morganbr See https://github.com/dotnet/roslyn/pull/24161

We do the following:

  1. Use System.HashCode if it is available. Done.
  2. Otherwise, if in C#: 2a. If not in checked-mode: Generate unrolled hash. 2b. If in checked-mode: Generate unrolled hash, wrapped in 'unchecked{}'.
  3. Otherwise, if in VB: 3b. If not in checked-mode: Generate unrolled hash. 3c. If in checked-mode, but has access to System.ValueTuple: Generate Return (a, b, c, ...).GetHashCode() 3d. If in checked-mode without access to System.ValueTuple. Generate unrolled hash, but add a comment in VB that overflows are very likely.

It's '3d' that's really unfortunate. Basically, someone using VB but not using ValueTuple or a recent System, will not be able to use us to get a reasonable hash algorithm generated for them.

You'd need to put a 256 entry lookup table somewhere in the code

This would be completely unpalatable :)

morganbr commented 6 years ago

Is table-generation code also unpalatable? At least going by Wikipedia's example, it's not much code (but it still has to go somewhere in the user's source).

jnm2 commented 6 years ago

How awful would it be to add the HashCode source to the project like Roslyn does (with IL) with (the much simpler) compiler attribute class definitions when they aren't available through any referenced assembly?

CyrusNajmabadi commented 6 years ago

How awful would it be to add the HashCode source to the project like Roslyn does with (the much simpler) compiler attribute class definitions when they aren't available through any referenced assembly?

  1. Does the HashCode source not need overflow behavior?
  2. I've skimmed the HashCode source. It's non trivial. Generating all that goop into the user's project would be pretty heavyweight.

I'm just surprised there are no good ways to get overflow math to work in VB at all :(

CyrusNajmabadi commented 6 years ago

So, at a minimum, even if we were hashing two values together, it seems like we would have to create:

            var hc1 = (uint)(value1?.GetHashCode() ?? 0); // can overflow
            var hc2 = (uint)(value2?.GetHashCode() ?? 0); // can overflow

            uint hash = MixEmptyState();
            hash += 8; // can overflow

            hash = QueueRound(hash, hc1);
            hash = QueueRound(hash, hc2);

            hash = MixFinal(hash);
            return (int)hash; // can overflow

Note that this code already has 4 lines that can overflow. It also has two helper functions you need to call (i'm ignoring MixEmptyState as that seems more like a constant). MixFinal can definitely overflow:

        private static uint MixFinal(uint hash)
        {
            hash ^= hash >> 15;
            hash *= Prime2;
            hash ^= hash >> 13;
            hash *= Prime3;
            hash ^= hash >> 16;
            return hash;
        }

as can QueueRound:

        private static uint QueueRound(uint hash, uint queuedValue)
        {
            hash += queuedValue * Prime3;
            return Rol(hash, 17) * Prime4;
        }

So i don't honestly see how this would work :(

CyrusNajmabadi commented 6 years ago

How awful would it be to add the HashCode source to the project like Roslyn does (with IL) with (the much

How do you envision this working? What would customers write, and what would the compilers then do in response?

CyrusNajmabadi commented 6 years ago

Also, something that would address all of this is if .Net already has public helpers exposed on the surface API that convert from uint to int32 (and vice versa) without overflow.

Do those exist? If so, i can easily write the VB versions, just using these for the situations where we need to go between the types without overflowing.

CyrusNajmabadi commented 6 years ago

Is table-generation code also unpalatable?

I would think so. I mean, think about this from a customer perspective. They just want a decent GetHashCode method that is nicely self contained and gives reasonable results. Having that feature go and bloat up their code with auxiliary crap is going to be pretty unpleasant. It's also pretty bad given that the C# experience will be just fine.

morganbr commented 6 years ago

You might be able to get roughly the right overflow behavior by casting to and from some combination of signed and unsigned 64-bit types. Something like this (untested and I don't know VB casting syntax):

Dim hashCode = -252780983
hashCode = (Int32)((Int32)((Unt64)hashCode * -1521134295) + (UInt64)i.GetHashCode())
CyrusNajmabadi commented 6 years ago

How do you knwo the following doesn't overflow?

(Int32)((Unt64)hashCode * -1521134295)

Or the final (int32) cast for that matter?

morganbr commented 6 years ago

I didn't realize it would use overflow-checked conv operations. I guess you could mask it down to 32 bits before casting:

(Int32)(((Unt64)hashCode * -1521134295) & 0xFFFFFFFF)
CyrusNajmabadi commented 6 years ago

presumably 31 bits, as a value of uint32.Max would also overflow on conversion to Int32 :)

That's def possible. Ugly... but possible :) There's gunna be a lot of casts in this code.

CyrusNajmabadi commented 6 years ago

Ok. I think i have a workable solution. The core of the algorithm we generate today is:

        hashCode = hashCode * -1521134295 + j.GetHashCode();

Let's say that we're doing 64bit math, but "hashCode" has been capped to 32 bits. Then <largest_32_bit> * -1521134295 + <largest_32_bit> will not overflow 64 bits. So we can always do the math in 64 bits, then clamp down to 32 (or 32bits) to ensure that the next round won't overflow.

CyrusNajmabadi commented 6 years ago

Thanks!

CyrusNajmabadi commented 6 years ago

@MaStr11 @morganbr @sharwell and everyone here. I've updated my code to generate the following for VB:

        Dim hashCode As Long = 2118541809
        hashCode = (hashCode * -1521134295 + a.GetHashCode()) And Integer.MaxValue
        hashCode = (hashCode * -1521134295 + b.GetHashCode()) And Integer.MaxValue
        Return CType(hashCode And Integer.MaxValue, Integer)

Can someone sanity check me to make sure that this makes sense and should not overflow even with checked mode on?

morganbr commented 6 years ago

@CyrusNajmabadi , that won't overflow (because Int64.Max = Int32.Max*Int32.Max and your constants are much smaller than that) but you're masking the high bit to zero, so it's only a 31-bit hash. Is leaving the high bit on considered an overflow?

jnm2 commented 6 years ago

@CyrusNajmabadi hashCode is a Long that can be anywhere from 0 to Integer.MaxValue. Why am I getting this?

image

But no, it can't actually overflow.

jnm2 commented 6 years ago

Btw- I'd rather have Roslyn add a NuGet package than add a suboptimal hash.

CyrusNajmabadi commented 6 years ago

but you're masking the high bit to zero, so it's only a 31-bit hash. Is leaving the high bit on considered an overflow?

That's a good point. I think i was thinking about another algorithm that was using uints. So in order to safely convert from the long to a uint, i needed to not include the sign bit. However, as this is all signed math, i think it would be fine to just mask against 0xffffffff ensuring we only keep the bottom 32bit after adding each entry.

CyrusNajmabadi commented 6 years ago

I'd rather have Roslyn add a NuGet package than add a suboptimal hash.

Users can already do that if they want. This is about what to do when users do not, or can not, add those dependencies. This is also about providing a reasonably 'good enough' hash for users. i.e. something better than the common "x + y + z" approach that people often take. It's not intended to be 'optimal' because there's no good definition of what 'optimal' is when it comes to hashing for all users. Note that the approach we're taking here is the one already emitted by the compiler for anonymous types. It exhibits reasonably good behavior while not adding a ton of complexity to the user's code. As time, as more and more users are able to move forward, such can can slowly disappear and be replaced with HashCode.Combine for most people.

CyrusNajmabadi commented 6 years ago

So i worked at it a bit and came up with the following that i think addresses all concerns:

        Dim hashCode As Long = 2118541809
        hashCode = (hashCode * -1521134295 + a.GetHashCode()).GetHashCode()
        hashCode = (hashCode * -1521134295 + b.GetHashCode()).GetHashCode()
        Return CType(hashCode, Integer)

The piece that's interesting is specifically calling .GetHashCode() on the int64 value produced by (hashCode * -1521134295 + a.GetHashCode()). Calling .GetHashCode on this 64 bit value has two good properties for our needs. First, it ensures that hashCode only ever stores a legal int32 value in it (which makes the final returning cast always safe to perform). Second, it ensures that we don't lose any valuable information in the upper 32bits of the int64 temp value we're working with.

jnm2 commented 6 years ago

@CyrusNajmabadi Actually offering to install the package is what I was asking about. Saves me from having to do it.

CyrusNajmabadi commented 6 years ago

If you type HashCode, then if System.HashCode is provided in an MS nuget package, then Roslyn will offer it.

jnm2 commented 6 years ago

I want it to generate the nonexistent GetHashCode overload and install the package in the same operation.

CyrusNajmabadi commented 6 years ago

I don't think that's an appropriate choice for most users. Adding dependencies is a very heavyweight operation that users should not be forced into. Users can decide the right time to make those choices, and the IDE will respect it. That's been the approach we've taken with all our features up to now, and it's been a healthy one that people seem to like.

CyrusNajmabadi commented 6 years ago

Note: what nuget package is this api even being included in for us to add a reference to?

morganbr commented 6 years ago

The implementation is in System.Private.CoreLib.dll, so it would come as part of the runtime package. The contract is System.Runtime.dll.

CyrusNajmabadi commented 6 years ago

Ok. If that's the case, then it sounds like a user would get this if/when they move to a more recent Target Framework. That sort of thing is not at all a step i would have the "generate equals+hashcode" do to a user's project.