JuliaServices / AutoHashEquals.jl

A Julia macro to add == and hash() to composite types.
Other
57 stars 12 forks source link

Structs with same name and layout in different modules will hash the same but not compare equal. #16

Closed NHDaly closed 1 year ago

NHDaly commented 4 years ago

If you define two structs, in two different modules, with the same name and layout (I know, I know... I'm sorry.), they'll end up hashing the same, but == will still return false. This violates the hashing rules:

julia> using AutoHashEquals

julia> module A
       Main.@auto_hash_equals struct Point
          x::Int
          y::Int
       end
       end
Main.A

julia> module B
       Main.@auto_hash_equals struct Point
          x::Int
          y::Int
       end
       end
Main.B

julia> A.Point(1,2) == B.Point(1,2)
false

julia> hash(A.Point(1,2)) == hash(B.Point(1,2))
true

(This is actually true even if the structs look slightly different):

julia> module A
       Main.@auto_hash_equals struct Point
          x::Int
          y::Int
       end
       end
Main.A

julia> module B
       Main.@auto_hash_equals struct Point{T}
          x::T
          y::T
       end
       end
Main.B

julia> A.Point(1,2) == B.Point(1,2)
false

julia> hash(A.Point(1,2)) == hash(B.Point(1,2))
true

Would it make sense to use a richer value for the "type" part of the hash? Currently we just use the symbol name of the type. e.g.:

            hash(a.y, hash(a.x, hash(:Point, h)))

https://github.com/andrewcooke/AutoHashEquals.jl/blob/b5dd75bf2e99194b99d217f78d92147f6ed852e4/src/AutoHashEquals.jl#L69

Could we instead consider writing either the fullname of the module + the name, i.e.:

            hash(a.y, hash(a.x, hash((fullname(@__MODULE__)..., :Point), h)))

or actually use the type object itself or its .uid field or something?:

            hash(a.y, hash(a.x, hash(Point, h)))       # or
            hash(a.y, hash(a.x, hash(Point.uid, h)))
NHDaly commented 4 years ago

(The actual use-case where this was coming up for us at RelationalAI is that we have two different AST representations for the different phases of our compiler (a Front IR and a Back IR), and some of the nodes in each of the ASTs share the same names and representation, but are defined in different modules. Using AutoHashEquals, they'd end up comparing != but hash the same.)

iamed2 commented 4 years ago

This violates the hashing rules

It does not; hash equality cannot imply == or isequal equality without perfect hashing. Collisions will always exist, and will be handled with isequal in Sets and Dicts.

NHDaly commented 4 years ago

Oh! Good point. I forgot about that.

Oh, thanks, you're right, i just checked and indeed the two same-name, same-fields AST nodes do indeed still work just fine in Dictionaries. Makes sense, thanks @iamed2! :)

Is it still worth making a change like this to still decrease the rate of collisions? I assume collisions make insertions/deletions/accesses more expensive?

NHDaly commented 4 years ago

(Also, hashing by the Type object itself somehow would also address #13)

iamed2 commented 4 years ago

Collisions do make access more expensive, but not by much, and how much depends on the addressing scheme.

Is there ever a situation where you'd have a lot of collisions in one dictionary? My intuition is that these types of colliding items would be unlikely to be stored in the same Set/Dict.

NHDaly commented 4 years ago

Hmm. I have to think about it.

We're moving basically everything in our system to be implemented on top of https://github.com/RelationalAI-oss/Salsa.jl, to get fully incremental performance everywhere. One of the consequences of this is that we're memoizing a bunch of functions in our system, and storing them all in big, versioned dictionaries that map from the function's arguments to its result.

So it is indeed possible that we'll see some of these things show up in the same spot in the same dictionary, e.g. imagine memoizing a simple function like magnitude_squared(point) = point.x^2 + point.y^2, which one might call on either type of Point.

Or for a more real example, getting the name of a declaration node in our AST:

@derived name(decl #= Front.Decl or Back.Decl =#) = decl.name

So then the cache dictionary would indeed have either one of these types as the key.

STILL, you're probably right that this kind of situation wouldn't occur very often. I am definitely less concerned than I was when I first opened the issue. Thanks for thinking this through with me.

I'm not sure how to trade-off the extra time spent hashing vs the potential saved time in handling collisions.

gafter commented 1 year ago

@NHDaly Is this issue actionable? If not, can you please close it?

gafter commented 1 year ago

Recent versions of this package provide a way to address this. Use the option typearg=true after the name of the macro @auto_hash_equals, and the type itself (including its containing package) will be included in the hash code. It is computed in a way that is stable between runs and between Julia versions.