Closed ericphanson closed 1 month ago
That sounds like a good idea.
This package has become somewhat more "important" now that Makie makes use of it: breaking it hurts a lot of other things, and fixing it for 1.10 was painful. #46 is meant as the long-term solution, so if we need to discuss this stuff, maybe that should be paused while these ideas are considered first.
I think in many cases it would not be too onerous to use StableHashTraits with a narrowed scope. Defining a few methods here and there for the things that it can't handle is usually pretty simple (that's part of the point of the trait-based dispatch), and the context mechanism means you don't have to worry about type piracy.
I worry that using StructTypes would be a bit of an issue, since how one wants to hash things may differ in use than how one wants to serialize. (E.g. serialization should preserve everything and allow for round tripping; hashing doesn't require reading and it can often be a lower burden to define how something should be hashed than how it should be read and written), but it would probably be worth thinking of concrete examples and seeing if that really is a problem.
One initial thought: most of the complexity arises in trying to create a stable id for types when you include their type parameters. One very simple change that would greatly reduce complexity, and would only marginally reduce scope would be to remove stable_type_id
entirely, and only include stable_typename_id
(which is much more reliable, and relies on no internals). State to users that if the type parameters of a type matter to the hash value, you as the user are required to implement an appropriate version of hash_method
that handles this.
I worry that using StructTypes would be a bit of an issue, since how one wants to hash things may differ in use than how one wants to serialize. (E.g. serialization should preserve everything and allow for round tripping; hashing doesn't require reading and it can often be a lower burden to define how something should be hashed than how it should be read and written), but it would probably be worth thinking of concrete examples and seeing if that really is a problem.
Maybe though it could be worth one level of indirection to have transform
or such that users can add dispatches to which is called first, and defaults to identity. I.e. transform
is called, then StructTypes is invoked to see the type and how to access the children, then then each child is accessed and transform
is called on each child prior to recursing. So if an MyObject
should always be hashed by only looking at field1
, transform
could be transform(x::MyObj) = x.field1
. I think StableHashTraits already has something like this, but has it's own trait-system instead of using something like StructTypes.
Mostly I think that a lot of complexity could be pushed off into StructTypes (and maybe something like AbstractTrees for traversal), making this package smaller and simpler, and could help make the semantics easier to explain. For example, StructTypes declares that DictTypes's are unordered (so e.g. the hashes should be combined in an order-independent manner like summing them), so any DictType would be treated as such. If this wasn't desired, one could transform
their object to something else fulfilling the StructTypes interface and customize that way, without StableHashTraits needing a full suite of customization options.
One initial thought: most of the complexity arises in trying to create a stable id for types when you include their type parameters. One very simple change that would greatly reduce complexity, and would only marginally reduce scope would be to remove stable_type_id entirely, and only include stable_typename_id (which is much more reliable, and relies on no internals). State to users that if the type parameters of a type matter to the hash value, you as the user are required to implement an appropriate version of hash_method that handles this.
This makes sense to me, although I also think one could go further and only count the StructType of the object (dict vs struct vs list) in the hash, not the julia type or defining module or other things. Because often the specific type (or module it's defined in) is an implementation detail. Again if someone really wanted to work around this, they could define a transform
that adds the type or some identifier string or whatever as a field of an object before hashing. Using only the struct type also means one doesn't need to introspect much about an object (e.g. it's fully qualified name), and provides stability in the face of minor changes to implementation details.
I also think instead of generated functions, some explicit memoization with a WeakKeyDict
of object to hash would make sense. IIUC that would fix the issue seen in https://github.com/beacon-biosignals/StableHashTraits.jl/issues/26#issuecomment-1664795936 since the transitions would only have to be hashed once. I think then Makie wouldn't have to do it's own memoization as well. (Here, having "weak" keys is important, so the dict itself doesn't hold onto references it shouldn't, otherwise we have a memory leak).
I also think instead of generated functions, some explicit memoization with a WeakKeyDict of object to hash would make sense.
If most of the time we are not hashing type information (because we use the SturctType of it instead) I think this is a viable strategy.
The main reason I reverted to using a generated function is because object types are littered all over the place and a huge amount of the time was being spent hashing the string representation of object types, so I moved that cost to compile time.
I like the notion of pushing off complexity to StructTypes paired with transform
—where transform dispatches on the hash context. I even suspect there could even be reasonable deprecation path — at the very least the existing traits could become some form of transformation to a CustomStruct
(but I think a number could become things like ArrayType, or DataType)—but if the hashes have so radically changes maybe that is less important.
@kleinschmidt: I wonder if you have any opinions here. When I was thinking through the latest design with Spencer you chimed in on that conversation saying that your own view was that any non-equal object should have a non-equal hash.
If we went with @ericphanson's proposal above, that would no longer be the case, strictly speaking. However, I think that in practice it is very rare to a have an object with equal content but unequal type, and it might be fair to be clear about project scope and to require the user to define a few small methods in situations where it is desired to differentiate content equal objects by their type.
any non-equal object should have a non-equal hash
Yeah, I think this is impossible in general (pretty much all hash algorithms have collisions, even if rare) and aiming for "any non-equal object should have non-equal hashes except in very rare collision scenarios" incurs too much maintenance burden. So I don't think that should be a goal.
IMO the goal should be: "it's very easy to tell when hash collisions will be rare or expected, because it's easy to understand what goes into computing the hash (and what goes in is very stable)".
in practice it is very rare to a have an object with equal content but unequal type
I have a slightly different take, which I think it's not that rare to have that (e.g. a live julia session could easily contain both StaticArray([1])
vs [1]
) but it's semi-rare to need to distinguish between those (often an individual cache has mostly objects of the same type or of a few fixed types rather than being totally generic), and it's OK for that to be out of scope.
require the user to define a few small methods in situations where it is desired to differentiate content equal objects by their type
Yeah, I think if this is feasible with a small, stable API surface (e.g. the generic transform
I mentioned above, which the user can use to do basically anything they want) then that's worthwhile, but we shouldn't make a API to opt-in specifically to "also compare by type" since then we have to maintain that.
Seconding @ericphanson : All hash algorithms necessarily have collisions because of the pigeonhole principle. What most algorithms try to do is have similar things have different hashes so that it's unlikely that something you need to distinguish based on a hash would collide.
Yup. I think one problem is that a thing people want to do with a stable hash is make a persistent cache, where you have a bunch of (key, value)
pairs (where the keys might be semi-complicated like a tuple (id, timespan, category)
and the object might be a big blob or something, which maybe you can compute on-demand from the key but isn't cheap), and then you generate a stable hash of the key to get a short-ish hex string, and then you stick the blob in s3 or on disk somewhere based on the hash (like prefix/$hash
or whatever). Then when you want to retrieve the value by key, you hash the key, lookup the path, check if it exists, and pull out the value. Here the advantage of the stable hash is you don't need to rely on the details of how timespans print or whatever, and your key can be detailed/big and still get a reasonably sized file path.
This naive cache is incorrect in general because of the possibility of hash collisions (which is largely independent of StableHashTraits - the possibility always exists) but cannot be fixed at the level of the hashing algorithm. Instead it must be handled at the level of the cache itself. E.g. by storing the key in some way alongside the value, and when you go to retrieve the value, first pull out the key and check there was no collision. (If there was a collision, you can just recompute the value without caching it, or also do something much more complicated to store both and disambiguate between them).
The reason this connects to this issue, is that I think one reason folks want "no collisions" is so they can do this naive caching, but it's just not possible to do correctly regardless of how nuanced StableHashTraits disambiguates values. All that does is change the probabilities from rare to expected/common, not eliminate them. So IMO we should make it easy to know where on that spectrum things are, but it's not actually that important to push "everything" to the rare side, since we cannot push things to "impossible" and that's where the big gains would be.
@kleinschmidt: I wonder if you have any opinions here. When I was thinking through the latest design with Spencer you chimed in on that conversation saying that your own view was that any non-equal object should have a non-equal hash.
To be fair, I think I worded this a little inaccurately, I think the actual request was for the input to the hash algorithm to be different in this case.
That said, I agree with all of what's been said: use of StableHashTraits should not assume there are zero collisions, and a transparent way of understanding what will and will not lead to identical hashes seems good. In any of the ways I've used it for and in the way Makie uses it, if something serializes to the same value, that is a sufficient definition of equality for the hash to have the same input (and thus hash consistently to the same value).
Regarding 1-3 in the OP:
transform
it, or whatever that gets called). I think this covers:
2-3. I think these points would be well covered by a design roughly like this:
hash_method(::MyType, ::Context)
, have the fallback when no specialized value for Context
is defined be the defaults from StructTypes
of MyType
. This maintains the flexibility of the API and the ability to avoid type piracy.FnHash
as a possible return value for hash_mehtod
: this will first transform the value (based on a function passed as the first argument) and then compute the hash_method
of the transformed value (and, similar to the current API, it should take an optional second argument that is the StructType
value to use as for the transformed value, in cases where not doing so would lead to infinite recursion).(Optionally one could keep ContextAndHash
up there as well, as it is low maintenance)
If this sounds good to others I'm happy to try out an implementation like this.
I also think instead of generated functions, some explicit memoization with a WeakKeyDict of object to hash would make sense. IIUC that would fix the issue seen in https://github.com/beacon-biosignals/StableHashTraits.jl/issues/26#issuecomment-1664795936 since the transitions would only have to be hashed once. I think then Makie wouldn't have to do it's own memoization as well. (Here, having "weak" keys is important, so the dict itself doesn't hold onto references it shouldn't, otherwise we have a memory leak).
One thought here: in principle one can define new methods between calls to stable_hash
so I think maintaining a dict of object hashes is risky. A less risky choice would be to implement caching per call to stable_hash
. (And if you want to memoize its value you can do that, but that is separate from the internals of dealing with e.g. multiple references to large objects).
I'd see the dict of type values (if that is needed at all anymore) as separate from the dict of cached object values, since the former is something you'd want to save across calls to stable_hash
(unlike objects, the hashed id of a type shouldn't really change).
One thought here: in principle one can define new methods between calls to stable_hash so I think maintaining a dict of object hashes is risky. A less risky choice would be to implement caching per call to stable_hash. (And if you want to memoize its value you can do that, but that is separate from the internals of dealing with e.g. multiple references to large objects).
One option here is to provide a Context
that holds a cache, and allow the user to re-use that context if they want to do so (taking responsibility for invalidation)
Keep FnHash as a possible return value for hash_mehtod: this will first transform the value (based on a function passed as the first argument) and then compute the hash_method of the transformed value (and, similar to the current API, it should take an optional second argument that is the StructType value to use as for the transformed value, in cases where not doing so would lead to infinite recursion).
Do we need FnHash
? It seems more complicated and that there are multiple levels at which to injection things here. What about:
stable_hash(::MyType, ::Context)
: entrypoint, calls _stable_hash(in_progress_hash, ::MyType, ::Context)
_stable_hash(in_progress_hash, ::MyType, ::Context)
: computes transformed = transform(::MyType, ::Context)
then calls _stable_hash_by_struct_type(in_progress_hash, transformed, StructTypes.StructType(transformed), ::Context)
_stable_hash_by_struct_type
for each StructTypes.StructType
, lowering to the children.In this setup, the only hook that users are allowed is defining transform(::MyType, ::Context)
(which defaults to identity
); the rest of the methods are internal.
I think the problem with that specific setup is that you can only apply the transform at the top level. E.g. if you transform ::Int in your ::Context it won't get applied to ::Int inside of ::MyType. It is useful to say "in this context I hash dates in this way".
But I'm not sure you're wrong that we could just have the one method transform
that gets applied appropriately, I just think that should be done such that it applies to the children of ::MyType as well. That seems more elegant.
should be done such that it applies to the children of ::MyType as well.
Yeah, that’s what I meant. In my _stable_hash_by_struct_type
I meant to say it should call _stable_hash
on each child, so there’s the opportunity to transform it.
Cool, okay. So to summarize:
transform(::MyType, ::Context)
that returns the value to hash to represent an object of type MyType
(defaulting to identity
)StructType(transform(x, context))
and dispatch on the result StructType
)Thinking on where this should live: I actually think it would be possible to define here (as 1.2 and 2.0). For 1.2 I'd add the new stuff as hash version 3 and mark any old methods as deprecated. In a follow-up PR the old pre 1.2 stuff would get deleted for 2.0.
That sounds good to me! Some more questions:
obj = (; my_type=Int)
and calls stable_hash(obj)
, should that error?
obj = (; my_mod = Base)
.Maybe the answers to these are basically "does StructTypes handle it", and if so we can just go with that (?).
Are there situations where naive use of StructTypes is really bad, that we need to special case somehow?
If StructTypes.jl itself changes something, do we commit to stability? I think probably not, I think we should say we are as stable as StructTypes is and we won't special case stuff to preserve stability.
Maybe it's worth tagging in @quinnj here: Jacob, we have this package (StableHashTraits) that tries to provide a stable_hash
function that gives a hash which is stable across julia sessions and julia versions. Our current version has a heavy maintenance burden because it tries to hash "everything" and depends on internals to do so. We also have a semi-complicated trait based system so the user can tell StableHashTraits how they want their object to be hashed.
I am proposing we defer much of this work to StructTypes. StructTypes can tell us what the relevant "children" are for a given object, and also we won't distinguish between e.g. two different ArrayType
's (but we would distinguish between an ArrayType
and a DictType
with the same fields). This provides additional stability in the face of minor implementation changes (the actual type) and means we don't have to figure out stuff like the fully qualified type name in a stable way.
Does this seem like a reasonable use of StructTypes to you? Or would you recommend against it.
- Do we want to support hashing of types? E.g. if someone puts obj = (; my_type=Int) and calls stable_hash(obj), should that error?
- I think probably it should unless we have a good way to do it that doesn't rely on internals (or printing, which isn't stable)
Both nameof
and parentmodule
are public API so I think types work just fine. I think this is wroth doing. This is what stable_typename_id
and qualified_name
use.
- if we do have a way to do it, what about parametric types? Could drop the params if that is easier
From what I understand there is no public API way to iterate over the type parameters of a type. There is a field (parameters
) of DataType that gives you these, and it has been around for a while (at least since 1.0). This is how I created the stable type string in #43.
- if we are hashing them, do we care where a type is defined? (i.e. fully qualified name). I think probably not.
I'm inclined to think it could matter. One can imagine very generic names (e.g. Point
, List
) having different implementations and meanings across modules.
- Do we want to support hashing of modules? E.g. obj = (; my_mod = Base).
nameof
and parentmodule
are both defined for Module
, so I'd say yes.
Are there situations where naive use of StructTypes is really bad, that we need to special case somehow?
There are a lot of tests in this repository now. One first step would be to see how StructType plays out in the tests that I've have been written here.
Closed by #58
I think this project's scope is too broad, which causes a relatively high-volume of maintenance work and reliance on internals (including details of printing / string representations, which are not covered by julia's semver guarantees).
I think we should rescope (possibly under a new name or new package) with maintainability as a core concern. In particular, I think:
Other more specific thoughts:
AbstractArray
types with the same values hash the same. Common "gotchas" should be documented.UInt64(0)
or such, or not support them.