JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.44k stars 5.46k forks source link

Equality and hash for self-referential recursive data-structures: Go Hopcroft? #26020

Open chethega opened 6 years ago

chethega commented 6 years ago

hash, isequal and == are currently unable to handle self-referential recursive structures:

A=Any[]; push!(A,A); B=deepcopy(A);
isequal(A,B)
A==A
hash(A)

all give StackOverflowError.

What should we do?

We already have perfectly fine non-recursive isequal for everything but Array and Tuple; here I'll just discuss Array. Two arrays L and R are different under isequal if and only if one of there exists an index sequence i_1,..., i_n such that isequal(L[i_1][i_2]...[i_n], R[i_1][i_2]...[i_n])=false.

In other words: L and R are finite state machines over the alphabet Int and they are equality means "they accept the same language" (ok, strictly speaking they don't "accept", they are maps from integer sequences to Union{non_recursive, internal, out-of-bounds}).

There are various O(N) algorithms for testing equality of DFAs. We could define the hash for recursive structures by e.g. replacing each x with "recursive -k", where k>0 is minimal with x === y[i_1][i_2]...[i_k] and isequal(x,y). Hash computation should be O(N log(N)) using something like Hopcroft.

In principle, overhead for typical non-self-referential objects should be pretty small: If the types tell us that we are non-self-referential, then there is no overhead at all; otherwise, one would probably start and check self-referentiality on the way, and only switch to fancy union-find once we found the first loop.

So, what do you think?

(1) Is this the right notion of equality? (2) If reasonably fast code existed, would it be worth the complexity to maintain/merge it? (3) Is this worth the effort to implement? (4) Could this be done post 1.0? Would moving from (maybe unreliable?) StackOverflowError to well-defined hash and equality be a breaking change? (5) If this is a good plan for self-referentials, then there should be a good way for user-defined equality and hash to tie into the DFA equality/minimization algorithm; hence related to https://github.com/JuliaLang/julia/issues/4648. This could be easily done with an essence.

I don't propose to change the default isequal for user-defined types; that's a separate discussion.

JeffBezanson commented 6 years ago

Yes I think it would be nice to implement this at some point. It could lead to hash values changing, but otherwise I believe is non-breaking and appropriate for a 1.x release.

Somewhat amusingly, femtolisp implements the union-find algorithm for this already (https://github.com/JeffBezanson/femtolisp/blob/master/equal.c).

StefanKarpinski commented 6 years ago

It could lead to hash values changing,

To clarify, this means that hash values would change from Julia 1.x to 1.(x+1) right? Not that hash values would change during a running process.

chethega commented 6 years ago

Somewhat amusingly, femtolisp implements the union-find algorithm for this already. Nice!

To clarify, this means that hash values would change from Julia 1.x to 1.(x+1) right? Not that hash values would change during a running process.

The way I thought about it, the only change would be some hash values changing from StackOverflowError to some UInt64, in the update 1.x to 1.(x+1), and some objects that currently compare as StackOverflowError becoming either equal or non-equal in the update 1.x to 1.(x+1). [I think Jeff has the same maybe-plan for the maybe-eventual change, because there is no real choice what to return]

The price is that comparison/hashing becomes more complex and slightly slower for trees (DAGs would become maybe faster / maybe slower depending on details, and cyclic structures would be log-linear instead of StackOverflowError)

Even if this is considered non-breaking and postponed into the far future, then one should still probably keep this plan in mind for the API for user-defined types with non-default hash and equality: There needs to be some kind of hook to go from the user code into the future Base code that handles all the union-find fun (otherwise the cyclic user-type won't profit from the eventual enhancement).

StefanKarpinski commented 6 years ago

We'll document that hash values may change in minor releases – people should not depend on hash values remaining the same or on dictionaries being iterated in the same order even if the same keys are inserted in the same order. We probably won't want to change this in patch version for the sake of being conservative but specific hash values are definitely not part of the stable API.

JeffBezanson commented 6 years ago

this means that hash values would change from Julia 1.x to 1.(x+1) right?

Yes.