Closed NHDaly closed 1 year ago
(The actual use-case where this was coming up for us at RelationalAI is that we have two different AST representations for the different phases of our compiler (a Front IR and a Back IR), and some of the nodes in each of the ASTs share the same names and representation, but are defined in different modules. Using AutoHashEquals, they'd end up comparing !=
but hash the same.)
This violates the hashing rules
It does not; hash equality cannot imply ==
or isequal
equality without perfect hashing. Collisions will always exist, and will be handled with isequal
in Sets and Dicts.
Oh! Good point. I forgot about that.
Oh, thanks, you're right, i just checked and indeed the two same-name, same-fields AST nodes do indeed still work just fine in Dictionaries. Makes sense, thanks @iamed2! :)
Is it still worth making a change like this to still decrease the rate of collisions? I assume collisions make insertions/deletions/accesses more expensive?
(Also, hashing by the Type object itself somehow would also address #13)
Collisions do make access more expensive, but not by much, and how much depends on the addressing scheme.
Is there ever a situation where you'd have a lot of collisions in one dictionary? My intuition is that these types of colliding items would be unlikely to be stored in the same Set/Dict.
Hmm. I have to think about it.
We're moving basically everything in our system to be implemented on top of https://github.com/RelationalAI-oss/Salsa.jl, to get fully incremental performance everywhere. One of the consequences of this is that we're memoizing a bunch of functions in our system, and storing them all in big, versioned dictionaries that map from the function's arguments to its result.
So it is indeed possible that we'll see some of these things show up in the same spot in the same dictionary, e.g. imagine memoizing a simple function like magnitude_squared(point) = point.x^2 + point.y^2
, which one might call on either type of Point.
Or for a more real example, getting the name of a declaration node in our AST:
@derived name(decl #= Front.Decl or Back.Decl =#) = decl.name
So then the cache dictionary would indeed have either one of these types as the key.
STILL, you're probably right that this kind of situation wouldn't occur very often. I am definitely less concerned than I was when I first opened the issue. Thanks for thinking this through with me.
I'm not sure how to trade-off the extra time spent hashing vs the potential saved time in handling collisions.
@NHDaly Is this issue actionable? If not, can you please close it?
Recent versions of this package provide a way to address this. Use the option typearg=true
after the name of the macro @auto_hash_equals
, and the type itself (including its containing package) will be included in the hash code. It is computed in a way that is stable between runs and between Julia versions.
If you define two structs, in two different modules, with the same name and layout (I know, I know... I'm sorry.), they'll end up hashing the same, but
==
will still return false. This violates the hashing rules:(This is actually true even if the structs look slightly different):
Would it make sense to use a richer value for the "type" part of the hash? Currently we just use the symbol name of the type. e.g.:
https://github.com/andrewcooke/AutoHashEquals.jl/blob/b5dd75bf2e99194b99d217f78d92147f6ed852e4/src/AutoHashEquals.jl#L69
Could we instead consider writing either the fullname of the module + the name, i.e.:
or actually use the type object itself or its
.uid
field or something?: