Closed jamesqo closed 4 years ago
Question for the participants here. The Roslyn IDE allows users to generate a GetHashCode impl based on a set of fields/properties in their class/struct . Ideally, people could use the new HashCode.Combine that was added in https://github.com/dotnet/corefx/pull/25013 . However, some users will not have access to that code. So, we'd like to still be able to generate a GetHashCode that will work for them.
Recently, it came to our attention that the form we generate is problematic. Namely, because VB compiles with overflow checks on by default, and our impl will cause overflows. Also, VB has no way to disable overflow checks for a region of code. It's either on or off entirely for the entire assembly.
Because of this, i'd love to be able to replace the impl we provide with a form that doesn't suffer from these problems. Ideally, the form generated would have the following properties:
a + b + c + d
or a ^ b ^ c ^ d
.For example, one option for VB would be to generate something like:
return (a, b, c, d).GetHashCode()
But this then depends on having a reference to System.ValueTuple. Ideally, we could have an impl that works even in the absence of that.
Does anyone know about a decent hashing algorithm that can work with these constraints? Thanks!
--
Note: our existing emitted code is:
Dim hashCode = -252780983
hashCode = hashCode * -1521134295 + i.GetHashCode()
hashCode = hashCode * -1521134295 + j.GetHashCode()
Return hashCode
This clearly can overflow.
This is also not a problem for C# as we can just add unchecked { }
around that code. That fine-grained control is not possible in VB.
Does anyone know about a decent hashing algorithm that can work with these constraints? Thanks!
Well, you could do Tuple.Create(...).GetHashCode()
. Obviously that incurs allocations, but it seems better than throwing an exception.
Is there any reason you can't just tell the user to install System.ValueTuple
? Since it's a builtin language feature, I'm sure the System.ValueTuple package is very compatible with basically all platforms right?
Obviously that incurs allocations, but it seems better than throwing an exception.
Yes. it would be nice to not have it cause allocations.
Is there any reason you can't just tell the user to install System.ValueTuple?
That would be the behavior if we generate the ValueTuple approach. However, again, it would be nice if we could just generate something good that fits with the way the user has currently structured their code, without making them change their structure in a heavyweight way.
It really does seem like VB users should have a way to address this problem in a reasonable manner :) But such an approach is eluding me :)
@CyrusNajmabadi, If you really need to do your own hash calculation in the user's code, CRC32 might work since it's a combination of table lookups and XORs (but not arithmetic that can overflow). There are some drawbacks though:
If you're not doing it already, I'd hope you can detect the HashCode type and use that when possible since XXHash should be much better.
@morganbr See https://github.com/dotnet/roslyn/pull/24161
We do the following:
Return (a, b, c, ...).GetHashCode()
3d. If in checked-mode without access to System.ValueTuple. Generate unrolled hash, but add a comment in VB that overflows are very likely.It's '3d' that's really unfortunate. Basically, someone using VB but not using ValueTuple or a recent System, will not be able to use us to get a reasonable hash algorithm generated for them.
You'd need to put a 256 entry lookup table somewhere in the code
This would be completely unpalatable :)
Is table-generation code also unpalatable? At least going by Wikipedia's example, it's not much code (but it still has to go somewhere in the user's source).
How awful would it be to add the HashCode source to the project like Roslyn does (with IL) with (the much simpler) compiler attribute class definitions when they aren't available through any referenced assembly?
How awful would it be to add the HashCode source to the project like Roslyn does with (the much simpler) compiler attribute class definitions when they aren't available through any referenced assembly?
I'm just surprised there are no good ways to get overflow math to work in VB at all :(
So, at a minimum, even if we were hashing two values together, it seems like we would have to create:
var hc1 = (uint)(value1?.GetHashCode() ?? 0); // can overflow
var hc2 = (uint)(value2?.GetHashCode() ?? 0); // can overflow
uint hash = MixEmptyState();
hash += 8; // can overflow
hash = QueueRound(hash, hc1);
hash = QueueRound(hash, hc2);
hash = MixFinal(hash);
return (int)hash; // can overflow
Note that this code already has 4 lines that can overflow. It also has two helper functions you need to call (i'm ignoring MixEmptyState as that seems more like a constant). MixFinal can definitely overflow:
private static uint MixFinal(uint hash)
{
hash ^= hash >> 15;
hash *= Prime2;
hash ^= hash >> 13;
hash *= Prime3;
hash ^= hash >> 16;
return hash;
}
as can QueueRound:
private static uint QueueRound(uint hash, uint queuedValue)
{
hash += queuedValue * Prime3;
return Rol(hash, 17) * Prime4;
}
So i don't honestly see how this would work :(
How awful would it be to add the HashCode source to the project like Roslyn does (with IL) with (the much
How do you envision this working? What would customers write, and what would the compilers then do in response?
Also, something that would address all of this is if .Net already has public helpers exposed on the surface API that convert from uint to int32 (and vice versa) without overflow.
Do those exist? If so, i can easily write the VB versions, just using these for the situations where we need to go between the types without overflowing.
Is table-generation code also unpalatable?
I would think so. I mean, think about this from a customer perspective. They just want a decent GetHashCode method that is nicely self contained and gives reasonable results. Having that feature go and bloat up their code with auxiliary crap is going to be pretty unpleasant. It's also pretty bad given that the C# experience will be just fine.
You might be able to get roughly the right overflow behavior by casting to and from some combination of signed and unsigned 64-bit types. Something like this (untested and I don't know VB casting syntax):
Dim hashCode = -252780983
hashCode = (Int32)((Int32)((Unt64)hashCode * -1521134295) + (UInt64)i.GetHashCode())
How do you knwo the following doesn't overflow?
(Int32)((Unt64)hashCode * -1521134295)
Or the final (int32) cast for that matter?
I didn't realize it would use overflow-checked conv operations. I guess you could mask it down to 32 bits before casting:
(Int32)(((Unt64)hashCode * -1521134295) & 0xFFFFFFFF)
presumably 31 bits, as a value of uint32.Max would also overflow on conversion to Int32 :)
That's def possible. Ugly... but possible :) There's gunna be a lot of casts in this code.
Ok. I think i have a workable solution. The core of the algorithm we generate today is:
hashCode = hashCode * -1521134295 + j.GetHashCode();
Let's say that we're doing 64bit math, but "hashCode" has been capped to 32 bits. Then <largest_32_bit> * -1521134295 + <largest_32_bit>
will not overflow 64 bits. So we can always do the math in 64 bits, then clamp down to 32 (or 32bits) to ensure that the next round won't overflow.
Thanks!
@MaStr11 @morganbr @sharwell and everyone here. I've updated my code to generate the following for VB:
Dim hashCode As Long = 2118541809
hashCode = (hashCode * -1521134295 + a.GetHashCode()) And Integer.MaxValue
hashCode = (hashCode * -1521134295 + b.GetHashCode()) And Integer.MaxValue
Return CType(hashCode And Integer.MaxValue, Integer)
Can someone sanity check me to make sure that this makes sense and should not overflow even with checked mode on?
@CyrusNajmabadi , that won't overflow (because Int64.Max = Int32.Max*Int32.Max and your constants are much smaller than that) but you're masking the high bit to zero, so it's only a 31-bit hash. Is leaving the high bit on considered an overflow?
@CyrusNajmabadi hashCode
is a Long
that can be anywhere from 0 to Integer.MaxValue
. Why am I getting this?
But no, it can't actually overflow.
Btw- I'd rather have Roslyn add a NuGet package than add a suboptimal hash.
but you're masking the high bit to zero, so it's only a 31-bit hash. Is leaving the high bit on considered an overflow?
That's a good point. I think i was thinking about another algorithm that was using uints. So in order to safely convert from the long to a uint, i needed to not include the sign bit. However, as this is all signed math, i think it would be fine to just mask against 0xffffffff ensuring we only keep the bottom 32bit after adding each entry.
I'd rather have Roslyn add a NuGet package than add a suboptimal hash.
Users can already do that if they want. This is about what to do when users do not, or can not, add those dependencies. This is also about providing a reasonably 'good enough' hash for users. i.e. something better than the common "x + y + z" approach that people often take. It's not intended to be 'optimal' because there's no good definition of what 'optimal' is when it comes to hashing for all users. Note that the approach we're taking here is the one already emitted by the compiler for anonymous types. It exhibits reasonably good behavior while not adding a ton of complexity to the user's code. As time, as more and more users are able to move forward, such can can slowly disappear and be replaced with HashCode.Combine for most people.
So i worked at it a bit and came up with the following that i think addresses all concerns:
Dim hashCode As Long = 2118541809
hashCode = (hashCode * -1521134295 + a.GetHashCode()).GetHashCode()
hashCode = (hashCode * -1521134295 + b.GetHashCode()).GetHashCode()
Return CType(hashCode, Integer)
The piece that's interesting is specifically calling .GetHashCode()
on the int64 value produced by (hashCode * -1521134295 + a.GetHashCode())
. Calling .GetHashCode on this 64 bit value has two good properties for our needs. First, it ensures that hashCode only ever stores a legal int32 value in it (which makes the final returning cast always safe to perform). Second, it ensures that we don't lose any valuable information in the upper 32bits of the int64 temp value we're working with.
@CyrusNajmabadi Actually offering to install the package is what I was asking about. Saves me from having to do it.
If you type HashCode, then if System.HashCode is provided in an MS nuget package, then Roslyn will offer it.
I want it to generate the nonexistent GetHashCode overload and install the package in the same operation.
I don't think that's an appropriate choice for most users. Adding dependencies is a very heavyweight operation that users should not be forced into. Users can decide the right time to make those choices, and the IDE will respect it. That's been the approach we've taken with all our features up to now, and it's been a healthy one that people seem to like.
Note: what nuget package is this api even being included in for us to add a reference to?
The implementation is in System.Private.CoreLib.dll, so it would come as part of the runtime package. The contract is System.Runtime.dll.
Ok. If that's the case, then it sounds like a user would get this if/when they move to a more recent Target Framework. That sort of thing is not at all a step i would have the "generate equals+hashcode" do to a user's project.
Update 6/16/17: Looking for volunteers
The API shape has been finalized. However, we're still deciding on the best hash algorithm out of a list of candidates to use for the implementation, and we need someone to help us measure the throughput/distribution of each algorithm. If you'd like to take that role up, please leave a comment below and @karelz will assign this issue to you.
Update 6/13/17: Proposal accepted!
Here's the API that was approved by @terrajobst at https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321:
The original text of this proposal follows.
Rationale
Generating a good hash code should not require use of ugly magic constants and bit twiddling on our code. It should be less tempting to write a bad-but-concise
GetHashCode
implementation such asProposal
We should add a
HashCode
type to enscapulate hash code creation and avoid forcing devs to get mixed up in the messy details. Here is my proposal, which is based off of https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329, with a few minor revisions.Remarks
See @terrajobst's comment at https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329 for the goals of this API; all of his remarks are valid. I would like to point out these ones in particular, however: