beacon-biosignals / StableHashTraits.jl

Compute hashes over any Julia object simply and reproducibly
MIT License
9 stars 3 forks source link

Type information lost for Pairs and Dicts with value type Any. #75

Open rasmushenningsson opened 2 weeks ago

rasmushenningsson commented 2 weeks ago

I'm not sure whether this is to be considered a bug or just an edge case that we have to live with. It's easy enough to work around with custom hash contexts when needed. But I thought it might be worth an issue.

Setup:

using StableHashTraits
struct Foo x::Int end
struct Bar x::Int end

The following work as expected:

julia> stable_hash(Foo(1); version=4) == stable_hash(Bar(1); version=4)
false

julia> stable_hash(Any[Foo(1)]; version=4) == stable_hash(Any[Bar(1)]; version=4)
false

julia> stable_hash(Tuple{Any}[(Foo(1),)]; version=4) == stable_hash(Tuple{Any}[(Bar(1),)]; version=4)
false

But for Pairs and Dicts with eltype Any, the distinction between Foo and Bar is lost:

julia> stable_hash(Dict{Symbol,Any}(:a=>Foo(1)); version=4) == stable_hash(Dict{Symbol,Any}(:a=>Bar(1)); version=4)
true

julia> stable_hash(Pair{Symbol,Any}(:a,Foo(1)); version=4) == stable_hash(Pair{Symbol,Any}(:a,Bar(1)); version=4)
true

julia> stable_hash(Pair{Symbol,Any}[:a=>Foo(1)]; version=4) == stable_hash(Pair{Symbol,Any}[:a=>Bar(1)]; version=4)
true

NB: The last two require #74 to run.

It boils down to hash_elements where isconcretetype(Pair{Symbol,Any}) returns true. Since hoist_type is also true, it skips type information for the elements.

rasmushenningsson commented 2 weeks ago

Here's a similar case with Vectors only. I personally think this is less of a problem, but I wanted to show it for completeness.

julia> stable_hash([Int[1]]; version=4) == stable_hash([Any[1]]; version=4)
false

julia> stable_hash(Vector{Any}[Int[1]]; version=4) == stable_hash(Vector{Any}[Any[1]]; version=4)
true

Again the problem is that isconcretetype(Vector{Any}) returns true.

ericphanson commented 2 weeks ago

hm, to me it sounds like hoist_type is auto-enabled too often when it is not "safe" to do so

rasmushenningsson commented 2 weeks ago

Yes, that sounds right. (I realize now that my Pair examples above do not hit hash_elements.)

This is how it looks like for Pair:

function transformer(::Type{<:Pair}, ::HashVersion{4})
    return Transformer(((a, b),) -> (a, b); hoist_type=true)
end

What would be a good way to set hoist_type here? Can/should we defer the decision to K and V given Pair{K,V}?

haberdashPI commented 2 weeks ago

Oooh... yeah this is a little tricky. I want to think through these examples a little bit more (but have some other pressing stuff to focus on at the moment).

I think the solution is to do something like this:

function transformer(::Type{<:Pair{K,Y}}, ::HashVersion{4}) where {K,Y}
    return Transformer(((a, b),) -> Tuple{K,Y}(a, b); hoist_type=true)
end

To maintain the abstract nature of the contained types (to avoid excess type-hoisting). I haven't thought through the second, array example yet, but I'm guessing it's a similar problem.

An added complication is it should probably only be updated in a new hash version...

haberdashPI commented 1 week ago

Having sat on this a bit and reviewed the different pieces of code, I actually think the right thing to target is not hoist_type but isconcretetype, as you originally posited @rasmushenningsson. This can happen even for types with a identity transform (such as the Vector{Any} example above).

The type hoisting is ultimately assuming that the contained type of a concrete type are themselves concrete, so the logic can't use isconcretetype, it needs a new method (is_fully_concrete??) that recursively verifies the concrete nature of both the current type and its contained type (where contained is defined by what StructType it is).

ericphanson commented 1 week ago

how slow is it if we just drop all the hoisting stuff? (maybe with a cache?) It seems to add a ton of complexity

rasmushenningsson commented 1 week ago

Something like this?

is_fully_concrete(x) = true # handles type parameters that are not types, e.g. dimensions
function is_fully_concrete(::Type{T}) where {T}
    if T isa Union
        return is_fully_concrete(T.a) && is_fully_concrete(T.b)
    end
    isconcretetype(T) || return false
    for S in T.parameters
        is_fully_concrete(S) || return false
    end
    return true
end

It might also slow things down for complicated types with layers of nested type parameters.

haberdashPI commented 1 week ago

In essence, yes! Though there are a number of details I expect to be different. We have been avoiding T.parameters since it is not documented as public, and there are a few places I have seen it recommended against. The goal is to avoid anything which could change in future julia versions. Instead I would use StructType and the public facing operations of eltype and fieldtypes to extract relevant type parameters. (We don't care about type parameters that aren't concrete if they don't correspond to fields or the eltype ).

But yes, this could slow down things for complicated types. At some point it might be good to add some benchmarks for deeply nested structures.

haberdashPI commented 1 week ago

If it was worth the lift, caching would easily turn that into a O(N) operation (but it's not clear yet to me if that's the case).

rasmushenningsson commented 1 week ago

Benchmarking seems like the way to go. There are a number of possible tweaks, e.g. is_fully_concrete could give up if the nesting is too deep.

haberdashPI commented 1 week ago

Oh... having thought through your vector example some more, I have realized something very silly:

ulia> x = Vector{Any}[Int[1]]
1-element Vector{Vector{Any}}:
 [1]

julia> x[1]
1-element Vector{Any}:
 1

So the reason the example leads to an equivalent hash is that the two values are the same.

haberdashPI commented 1 week ago

Oh wow... 🀯

I had thoroughly confused myself. This stuff is pretty tricky to reason about. It turns out that another way to fix this problem is to revert the change in #74 and replace it with:

hash_trait(::Pair) = StructTypes.OrderedStruct()

Then the examples above work as expected.

The issue here is that, as @ericphanson correctly observed, it is not actually safe to set hoist_type=true when transforming Pair into a Tuple. The problem being that it makes existing values that were in a field of type Any fall into a field (of the Tuple) of their concrete type of the value in that field. Which is exactly the reason for writing e.g. pick_fields to avoid this in the case of generating NamedTuple values of structs.

As I was reasoning through my proposed fix and the use of is_fully_concrete I realized that the same logic occurs (via the use of isconcretetype) inside of nested calls to hash_fields, so then I was confused about why we needed is_fully_concrete.

Problems only show up when hash_fields/elements doesn't get to see the types as they originally exist in the container (e.g. when calling transform).

So at the end of the day, the problem here is that the original method I wrote for transformer(::Type{<:Pair}) was wrong, fixing that problem and providing Pair an appropriate StructType means we no longer see the problems above.

Please feel free to check my reasoning here, but I'm fairly certain at the moment that it isn't possible to construct a concrete type containing e.g. Any that will produce a hash collision. #77 has a working version with the above fix.

My plan is to sit on this a little bit longer, and try to think of some counter examples.

haberdashPI commented 1 week ago

how slow is it if we just drop all the hoisting stuff? (maybe with a cache?) It seems to add a ton of complexity

The original motivation for adding type hoisting was that caching didn't improve speeds all that much, and things were quite slow compared to the baseline.

Here's a run of the benchmarks, where I've commented out the branches that type hoist in hash_fields and hash_elements:

32Γ—6 DataFrame
 Row β”‚ benchmark   hash       version    base        trait       ratio
     β”‚ SubStrin…   SubStrin…  SubStrin…  String      String      Float64
─────┼───────────────────────────────────────────────────────────────────────
   1 β”‚ structs     crc        3          59.333 ΞΌs   2.084 ms      35.1175
   2 β”‚ tuples      crc        3          59.042 ΞΌs   1.632 ms      27.6392
   3 β”‚ numbers     crc        3          29.500 ΞΌs   139.958 ΞΌs     4.74434
   4 β”‚ dicts       crc        3          1.788 ms    8.296 ms       4.64083
   5 β”‚ dataframes  crc        3          72.000 ΞΌs   289.917 ΞΌs     4.02663
   6 β”‚ strings     crc        3          1.078 ms    1.049 ms       0.973519
   7 β”‚ missings    crc        3          322.708 ΞΌs  201.333 ΞΌs     0.623886
   8 β”‚ symbols     crc        3          1.356 ms    721.000 ΞΌs     0.531694
   9 β”‚ structs     sha256     3          515.166 ΞΌs  3.117 ms       6.04983
  10 β”‚ tuples      sha256     3          506.250 ΞΌs  2.693 ms       5.31918
  11 β”‚ dicts       sha256     3          2.342 ms    10.036 ms      4.28503
  12 β”‚ numbers     sha256     3          257.666 ΞΌs  371.291 ΞΌs     1.44098
  13 β”‚ dataframes  sha256     3          519.834 ΞΌs  745.292 ΞΌs     1.43371
  14 β”‚ strings     sha256     3          2.045 ms    2.227 ms       1.08899
  15 β”‚ symbols     sha256     3          2.297 ms    2.268 ms       0.987556
  16 β”‚ missings    sha256     3          594.167 ΞΌs  529.917 ΞΌs     0.891865
  17 β”‚ structs     crc        4          58.959 ΞΌs   179.828 ms  3050.05
  18 β”‚ tuples      crc        4          59.083 ΞΌs   164.344 ms  2781.58
  19 β”‚ numbers     crc        4          30.125 ΞΌs   45.617 ms   1514.24
  20 β”‚ dataframes  crc        4          72.958 ΞΌs   94.657 ms   1297.42
  21 β”‚ missings    crc        4          325.458 ΞΌs  82.838 ms    254.528
  22 β”‚ dicts       crc        4          1.798 ms    216.961 ms   120.64
  23 β”‚ strings     crc        4          1.083 ms    44.720 ms     41.2972
  24 β”‚ symbols     crc        4          1.327 ms    22.998 ms     17.3263
  25 β”‚ structs     sha256     4          505.625 ΞΌs  233.061 ms   460.936
  26 β”‚ tuples      sha256     4          511.083 ΞΌs  204.909 ms   400.931
  27 β”‚ dataframes  sha256     4          519.417 ΞΌs  117.985 ms   227.149
  28 β”‚ numbers     sha256     4          257.667 ΞΌs  57.100 ms    221.603
  29 β”‚ missings    sha256     4          582.833 ΞΌs  114.964 ms   197.251
  30 β”‚ dicts       sha256     4          2.269 ms    265.459 ms   116.968
  31 β”‚ strings     sha256     4          2.046 ms    60.122 ms     29.3845
  32 β”‚ symbols     sha256     4          2.248 ms    47.526 ms     21.1403
haberdashPI commented 1 week ago

@ericphanson If I recall correctly, adding caching to the mix (without type hoisting) was about 3-6x faster than without it, so still roughly 100-1000x slower than with type hoisting.

rasmushenningsson commented 1 week ago

I agree that it's tricky to reason about, and easy to go down the wrong rabbit hole. :smile:

The fix to treat Pair as an OrderedStruct seems like the right one.

In the current version of PR #77 I still get a failure here:

julia> stable_hash(Vector{Any}[Int[1]]; version=4) == stable_hash(Vector{Any}[Any[1]]; version=4)
true

What's your take on that one?

haberdashPI commented 1 week ago

What's your take on that one?

See my comment above πŸ™‚

Oh... having thought through your vector example some more, I have realized something very silly:

ulia> x = Vector{Any}[Int[1]]
1-element Vector{Vector{Any}}:
 [1]

julia> x[1]
1-element Vector{Any}:
 1

So the reason the example leads to an equivalent hash is that the two values are the same.