JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.14k stars 5.43k forks source link

Base.PersistentDict/HAMT: rehashing is ineffective #54740

Open xitology opened 1 month ago

xitology commented 1 month ago

I'm looking at the implementation of PersistentDict and HAMT in https://github.com/JuliaLang/julia/blob/master/base/hamt.jl. This implementation attempts to avoid hash collisions by using rehashing, but I believe it does not provide the intended effect.

A comment in the code claims Perfect hash collisions should not occur in practice since we perform rehashing after using 55 bits (MAX_SHIFT) of hash. Here's how rehashing is explained in Bagwell's paper: The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 232 of a further collision. However the Julia implementation rehashes not the original key, but the previous hash value. If two hashes collide, so will the rehashed hashes.

Here's a test case:

julia> versioninfo()
Julia Version 1.12.0-DEV.689
Commit 94a0ee8637b (2024-06-08 14:18 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

julia> const KEY1 = 0;

julia> const KEY2 = 0xeb8449dc1393171b;

julia> @assert objectid(KEY1) == objectid(KEY2)

julia> Base.PersistentDict{Any,Any}(KEY1 => true, KEY2 => false)
ERROR: Perfect hash collision
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:44
 [2] insert!
   @ ./hamt.jl:207 [inlined]
 [3] _keyvalueset
   @ ./dict.jl:879 [inlined]
 [4] set(dict::Base.PersistentDict{Any, Any}, key::UInt64, val::Bool)
   @ Base ./dict.jl:874
 [5] PersistentDict
   @ ./dict.jl:951 [inlined]
 [6] Base.PersistentDict{Any, Any}(KV::Pair{Int64, Bool}, rest::Pair{UInt64, Bool})
   @ Base ./dict.jl:957
 [7] top-level scope
   @ REPL[8]:1

Note that IdDict can handle such keys:

julia> IdDict{Any,Any}(KEY1 => true, KEY2 => false)
IdDict{Any, Any} with 2 entries:
  0                  => true
  0xeb8449dc1393171b => false
vchuravy commented 1 month ago

I will need to think about what the right answer is here. Before #52193 we indeed used the key itself here.

But yes an objectid collision does cause a perfect hash collision. I am kinda curious how IdDict survives that.

xitology commented 1 month ago

I will need to think about what the right answer is here. Before #52193 we indeed used the key itself here.

Even with pre-#52193 implementation rehashing was ineffective for some key types, e.g. Symbols. That's because hash(s::Symbol) is defined as objectid(s) and hash(s, h) depends only on objectid(s): https://github.com/JuliaLang/julia/blob/master/base/hashing.jl#L38-L40

vchuravy commented 1 month ago

So looking at IDdict it also "just" uses objectid and then uses the typical probe + egal check and grow the table on conflict.

For HAMT that statregy wouldn't work. We would probably need to introduce a "PerfectConflict" node with a linear probe, but as you noted with Symbols this is a general property of using objectid as source of the hash.

oscardssmith commented 1 month ago

I'm pretty sure it is a bug if objectid has collisions. It seems like something that could be provably made to cause miscompiles.