Open rasmushenningsson opened 2 weeks ago
Here's a similar case with Vector
s only.
I personally think this is less of a problem, but I wanted to show it for completeness.
julia> stable_hash([Int[1]]; version=4) == stable_hash([Any[1]]; version=4)
false
julia> stable_hash(Vector{Any}[Int[1]]; version=4) == stable_hash(Vector{Any}[Any[1]]; version=4)
true
Again the problem is that isconcretetype(Vector{Any})
returns true.
hm, to me it sounds like hoist_type
is auto-enabled too often when it is not "safe" to do so
Yes, that sounds right. (I realize now that my Pair
examples above do not hit hash_elements
.)
This is how it looks like for Pair
:
function transformer(::Type{<:Pair}, ::HashVersion{4})
return Transformer(((a, b),) -> (a, b); hoist_type=true)
end
What would be a good way to set hoist_type
here? Can/should we defer the decision to K
and V
given Pair{K,V}
?
Oooh... yeah this is a little tricky. I want to think through these examples a little bit more (but have some other pressing stuff to focus on at the moment).
I think the solution is to do something like this:
function transformer(::Type{<:Pair{K,Y}}, ::HashVersion{4}) where {K,Y}
return Transformer(((a, b),) -> Tuple{K,Y}(a, b); hoist_type=true)
end
To maintain the abstract nature of the contained types (to avoid excess type-hoisting). I haven't thought through the second, array example yet, but I'm guessing it's a similar problem.
An added complication is it should probably only be updated in a new hash version...
Having sat on this a bit and reviewed the different pieces of code, I actually think the right thing to target is not hoist_type
but isconcretetype
, as you originally posited @rasmushenningsson. This can happen even for types with a identity
transform (such as the Vector{Any}
example above).
The type hoisting is ultimately assuming that the contained type of a concrete type are themselves concrete, so the logic can't use isconcretetype
, it needs a new method (is_fully_concrete
??) that recursively verifies the concrete nature of both the current type and its contained type (where contained is defined by what StructType
it is).
how slow is it if we just drop all the hoisting stuff? (maybe with a cache?) It seems to add a ton of complexity
Something like this?
is_fully_concrete(x) = true # handles type parameters that are not types, e.g. dimensions
function is_fully_concrete(::Type{T}) where {T}
if T isa Union
return is_fully_concrete(T.a) && is_fully_concrete(T.b)
end
isconcretetype(T) || return false
for S in T.parameters
is_fully_concrete(S) || return false
end
return true
end
It might also slow things down for complicated types with layers of nested type parameters.
In essence, yes! Though there are a number of details I expect to be different. We have been avoiding T.parameters
since it is not documented as public, and there are a few places I have seen it recommended against. The goal is to avoid anything which could change in future julia versions. Instead I would use StructType
and the public facing operations of eltype
and fieldtypes
to extract relevant type parameters. (We don't care about type parameters that aren't concrete if they don't correspond to fields or the eltype ).
But yes, this could slow down things for complicated types. At some point it might be good to add some benchmarks for deeply nested structures.
If it was worth the lift, caching would easily turn that into a O(N) operation (but it's not clear yet to me if that's the case).
Benchmarking seems like the way to go.
There are a number of possible tweaks, e.g. is_fully_concrete
could give up if the nesting is too deep.
Oh... having thought through your vector example some more, I have realized something very silly:
ulia> x = Vector{Any}[Int[1]]
1-element Vector{Vector{Any}}:
[1]
julia> x[1]
1-element Vector{Any}:
1
So the reason the example leads to an equivalent hash is that the two values are the same.
Oh wow... π€―
I had thoroughly confused myself. This stuff is pretty tricky to reason about. It turns out that another way to fix this problem is to revert the change in #74 and replace it with:
hash_trait(::Pair) = StructTypes.OrderedStruct()
Then the examples above work as expected.
The issue here is that, as @ericphanson correctly observed, it is not actually safe to set hoist_type=true
when transforming Pair
into a Tuple. The problem being that it makes existing values that were in a field of type Any
fall into a field (of the Tuple
) of their concrete type of the value in that field. Which is exactly the reason for writing e.g. pick_fields
to avoid this in the case of generating NamedTuple
values of structs.
As I was reasoning through my proposed fix and the use of is_fully_concrete
I realized that the same logic occurs (via the use of isconcretetype
) inside of nested calls to hash_fields
, so then I was confused about why we needed is_fully_concrete
.
Problems only show up when hash_fields/elements
doesn't get to see the types as they originally exist in the container (e.g. when calling transform
).
So at the end of the day, the problem here is that the original method I wrote for transformer(::Type{<:Pair})
was wrong, fixing that problem and providing Pair
an appropriate StructType
means we no longer see the problems above.
Please feel free to check my reasoning here, but I'm fairly certain at the moment that it isn't possible to construct a concrete type containing e.g. Any
that will produce a hash collision. #77 has a working version with the above fix.
My plan is to sit on this a little bit longer, and try to think of some counter examples.
how slow is it if we just drop all the hoisting stuff? (maybe with a cache?) It seems to add a ton of complexity
The original motivation for adding type hoisting was that caching didn't improve speeds all that much, and things were quite slow compared to the baseline.
Here's a run of the benchmarks, where I've commented out the branches that type hoist in hash_fields
and hash_elements
:
32Γ6 DataFrame
Row β benchmark hash version base trait ratio
β SubStrinβ¦ SubStrinβ¦ SubStrinβ¦ String String Float64
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β structs crc 3 59.333 ΞΌs 2.084 ms 35.1175
2 β tuples crc 3 59.042 ΞΌs 1.632 ms 27.6392
3 β numbers crc 3 29.500 ΞΌs 139.958 ΞΌs 4.74434
4 β dicts crc 3 1.788 ms 8.296 ms 4.64083
5 β dataframes crc 3 72.000 ΞΌs 289.917 ΞΌs 4.02663
6 β strings crc 3 1.078 ms 1.049 ms 0.973519
7 β missings crc 3 322.708 ΞΌs 201.333 ΞΌs 0.623886
8 β symbols crc 3 1.356 ms 721.000 ΞΌs 0.531694
9 β structs sha256 3 515.166 ΞΌs 3.117 ms 6.04983
10 β tuples sha256 3 506.250 ΞΌs 2.693 ms 5.31918
11 β dicts sha256 3 2.342 ms 10.036 ms 4.28503
12 β numbers sha256 3 257.666 ΞΌs 371.291 ΞΌs 1.44098
13 β dataframes sha256 3 519.834 ΞΌs 745.292 ΞΌs 1.43371
14 β strings sha256 3 2.045 ms 2.227 ms 1.08899
15 β symbols sha256 3 2.297 ms 2.268 ms 0.987556
16 β missings sha256 3 594.167 ΞΌs 529.917 ΞΌs 0.891865
17 β structs crc 4 58.959 ΞΌs 179.828 ms 3050.05
18 β tuples crc 4 59.083 ΞΌs 164.344 ms 2781.58
19 β numbers crc 4 30.125 ΞΌs 45.617 ms 1514.24
20 β dataframes crc 4 72.958 ΞΌs 94.657 ms 1297.42
21 β missings crc 4 325.458 ΞΌs 82.838 ms 254.528
22 β dicts crc 4 1.798 ms 216.961 ms 120.64
23 β strings crc 4 1.083 ms 44.720 ms 41.2972
24 β symbols crc 4 1.327 ms 22.998 ms 17.3263
25 β structs sha256 4 505.625 ΞΌs 233.061 ms 460.936
26 β tuples sha256 4 511.083 ΞΌs 204.909 ms 400.931
27 β dataframes sha256 4 519.417 ΞΌs 117.985 ms 227.149
28 β numbers sha256 4 257.667 ΞΌs 57.100 ms 221.603
29 β missings sha256 4 582.833 ΞΌs 114.964 ms 197.251
30 β dicts sha256 4 2.269 ms 265.459 ms 116.968
31 β strings sha256 4 2.046 ms 60.122 ms 29.3845
32 β symbols sha256 4 2.248 ms 47.526 ms 21.1403
@ericphanson If I recall correctly, adding caching to the mix (without type hoisting) was about 3-6x faster than without it, so still roughly 100-1000x slower than with type hoisting.
I agree that it's tricky to reason about, and easy to go down the wrong rabbit hole. :smile:
The fix to treat Pair
as an OrderedStruct
seems like the right one.
In the current version of PR #77 I still get a failure here:
julia> stable_hash(Vector{Any}[Int[1]]; version=4) == stable_hash(Vector{Any}[Any[1]]; version=4)
true
What's your take on that one?
What's your take on that one?
See my comment above π
Oh... having thought through your vector example some more, I have realized something very silly:
ulia> x = Vector{Any}[Int[1]] 1-element Vector{Vector{Any}}: [1] julia> x[1] 1-element Vector{Any}: 1
So the reason the example leads to an equivalent hash is that the two values are the same.
I'm not sure whether this is to be considered a bug or just an edge case that we have to live with. It's easy enough to work around with custom hash contexts when needed. But I thought it might be worth an issue.
Setup:
The following work as expected:
But for
Pair
s andDict
s witheltype
Any
, the distinction betweenFoo
andBar
is lost:NB: The last two require #74 to run.
It boils down to
hash_elements
whereisconcretetype(Pair{Symbol,Any})
returns true. Sincehoist_type
is also true, it skips type information for the elements.