Pairing down the project's scope

beacon-biosignals / StableHashTraits.jl

Compute hashes over any Julia object simply and reproducibly

MIT License

9 stars 3 forks source link

Pairing down the project's scope #50

Closed ericphanson closed 1 month ago

ericphanson commented 9 months ago

I think this project's scope is too broad, which causes a relatively high-volume of maintenance work and reliance on internals (including details of printing / string representations, which are not covered by julia's semver guarantees).

I think we should rescope (possibly under a new name or new package) with maintainability as a core concern. In particular, I think:

We should identify narrow use-cases for stable content-based hashing, and verify that in these situations a hash collision results does not result in a correctness issue (merely a performance one). These should be documented as the supported use-cases of the package.
- We should declare other usages of the package as unsupported. We should only support a new use-case if it does not significantly increase the maintenance burden and does not require use of Julia or package internals. Often the answer should be "sorry, it is out of scope for this package doesn't help with that".
Informed by these use-cases, we should choose simple and clear semantics for when collisions are known to happen, and prefer to allow collisions over relying on internals.
- Instead of relying on increasing amounts of introspection to differentiate objects, we should require the caller to use the package in a way that they don't need to differentiate such objects.
We should carefully weigh the tradeoff of extensibility / customization vs complexity and maintenance work, and lean towards less extensibility and less maintenance work.

Other more specific thoughts:

We should support hashing a broad variety of Julia objects in a stable way. I.e. we should typically not throw an error when hashing an object. However, we do not need to support everything and should clearly document what is supported.
- it should be OK for some known collisions to occur. For example, I think it is probably OK if two AbstractArray types with the same values hash the same. Common "gotchas" should be documented.
- likewise, it could be OK to hash every type to UInt64(0) or such, or not support them.
We should be judicious in our use of dispatch and traits.
We should be comfortable asking the caller to transform or preprocess their object in some manner to facilitate hashing, or ask them to pass only the part of the object that should contribute to it's "identity"
We should consider using StructTypes or another package to handle "what kind of things is it" and not take on that responsibility, and document how the system works

haberdashPI commented 9 months ago

That sounds like a good idea.

This package has become somewhat more "important" now that Makie makes use of it: breaking it hurts a lot of other things, and fixing it for 1.10 was painful. #46 is meant as the long-term solution, so if we need to discuss this stuff, maybe that should be paused while these ideas are considered first.

I think in many cases it would not be too onerous to use StableHashTraits with a narrowed scope. Defining a few methods here and there for the things that it can't handle is usually pretty simple (that's part of the point of the trait-based dispatch), and the context mechanism means you don't have to worry about type piracy.

I worry that using StructTypes would be a bit of an issue, since how one wants to hash things may differ in use than how one wants to serialize. (E.g. serialization should preserve everything and allow for round tripping; hashing doesn't require reading and it can often be a lower burden to define how something should be hashed than how it should be read and written), but it would probably be worth thinking of concrete examples and seeing if that really is a problem.

One initial thought: most of the complexity arises in trying to create a stable id for types when you include their type parameters. One very simple change that would greatly reduce complexity, and would only marginally reduce scope would be to remove stable_type_id entirely, and only include stable_typename_id (which is much more reliable, and relies on no internals). State to users that if the type parameters of a type matter to the hash value, you as the user are required to implement an appropriate version of hash_method that handles this.

ericphanson commented 9 months ago

I worry that using StructTypes would be a bit of an issue, since how one wants to hash things may differ in use than how one wants to serialize. (E.g. serialization should preserve everything and allow for round tripping; hashing doesn't require reading and it can often be a lower burden to define how something should be hashed than how it should be read and written), but it would probably be worth thinking of concrete examples and seeing if that really is a problem.

Maybe though it could be worth one level of indirection to have transform or such that users can add dispatches to which is called first, and defaults to identity. I.e. transform is called, then StructTypes is invoked to see the type and how to access the children, then then each child is accessed and transform is called on each child prior to recursing. So if an MyObject should always be hashed by only looking at field1, transform could be transform(x::MyObj) = x.field1. I think StableHashTraits already has something like this, but has it's own trait-system instead of using something like StructTypes.

Mostly I think that a lot of complexity could be pushed off into StructTypes (and maybe something like AbstractTrees for traversal), making this package smaller and simpler, and could help make the semantics easier to explain. For example, StructTypes declares that DictTypes's are unordered (so e.g. the hashes should be combined in an order-independent manner like summing them), so any DictType would be treated as such. If this wasn't desired, one could transform their object to something else fulfilling the StructTypes interface and customize that way, without StableHashTraits needing a full suite of customization options.

One initial thought: most of the complexity arises in trying to create a stable id for types when you include their type parameters. One very simple change that would greatly reduce complexity, and would only marginally reduce scope would be to remove stable_type_id entirely, and only include stable_typename_id (which is much more reliable, and relies on no internals). State to users that if the type parameters of a type matter to the hash value, you as the user are required to implement an appropriate version of hash_method that handles this.

This makes sense to me, although I also think one could go further and only count the StructType of the object (dict vs struct vs list) in the hash, not the julia type or defining module or other things. Because often the specific type (or module it's defined in) is an implementation detail. Again if someone really wanted to work around this, they could define a transform that adds the type or some identifier string or whatever as a field of an object before hashing. Using only the struct type also means one doesn't need to introspect much about an object (e.g. it's fully qualified name), and provides stability in the face of minor changes to implementation details.

I also think instead of generated functions, some explicit memoization with a WeakKeyDict of object to hash would make sense. IIUC that would fix the issue seen in https://github.com/beacon-biosignals/StableHashTraits.jl/issues/26#issuecomment-1664795936 since the transitions would only have to be hashed once. I think then Makie wouldn't have to do it's own memoization as well. (Here, having "weak" keys is important, so the dict itself doesn't hold onto references it shouldn't, otherwise we have a memory leak).

haberdashPI commented 9 months ago

I also think instead of generated functions, some explicit memoization with a WeakKeyDict of object to hash would make sense.

If most of the time we are not hashing type information (because we use the SturctType of it instead) I think this is a viable strategy.

The main reason I reverted to using a generated function is because object types are littered all over the place and a huge amount of the time was being spent hashing the string representation of object types, so I moved that cost to compile time.

haberdashPI commented 9 months ago

I like the notion of pushing off complexity to StructTypes paired with transform—where transform dispatches on the hash context. I even suspect there could even be reasonable deprecation path — at the very least the existing traits could become some form of transformation to a CustomStruct (but I think a number could become things like ArrayType, or DataType)—but if the hashes have so radically changes maybe that is less important.

haberdashPI commented 9 months ago

@kleinschmidt: I wonder if you have any opinions here. When I was thinking through the latest design with Spencer you chimed in on that conversation saying that your own view was that any non-equal object should have a non-equal hash.

If we went with @ericphanson's proposal above, that would no longer be the case, strictly speaking. However, I think that in practice it is very rare to a have an object with equal content but unequal type, and it might be fair to be clear about project scope and to require the user to define a few small methods in situations where it is desired to differentiate content equal objects by their type.

ericphanson commented 9 months ago

any non-equal object should have a non-equal hash

Yeah, I think this is impossible in general (pretty much all hash algorithms have collisions, even if rare) and aiming for "any non-equal object should have non-equal hashes except in very rare collision scenarios" incurs too much maintenance burden. So I don't think that should be a goal.

IMO the goal should be: "it's very easy to tell when hash collisions will be rare or expected, because it's easy to understand what goes into computing the hash (and what goes in is very stable)".

in practice it is very rare to a have an object with equal content but unequal type

I have a slightly different take, which I think it's not that rare to have that (e.g. a live julia session could easily contain both StaticArray([1]) vs [1]) but it's semi-rare to need to distinguish between those (often an individual cache has mostly objects of the same type or of a few fixed types rather than being totally generic), and it's OK for that to be out of scope.

require the user to define a few small methods in situations where it is desired to differentiate content equal objects by their type

Yeah, I think if this is feasible with a small, stable API surface (e.g. the generic transform I mentioned above, which the user can use to do basically anything they want) then that's worthwhile, but we shouldn't make a API to opt-in specifically to "also compare by type" since then we have to maintain that.

palday commented 9 months ago

Seconding @ericphanson : All hash algorithms necessarily have collisions because of the pigeonhole principle. What most algorithms try to do is have similar things have different hashes so that it's unlikely that something you need to distinguish based on a hash would collide.

ericphanson commented 9 months ago

Yup. I think one problem is that a thing people want to do with a stable hash is make a persistent cache, where you have a bunch of (key, value) pairs (where the keys might be semi-complicated like a tuple (id, timespan, category) and the object might be a big blob or something, which maybe you can compute on-demand from the key but isn't cheap), and then you generate a stable hash of the key to get a short-ish hex string, and then you stick the blob in s3 or on disk somewhere based on the hash (like prefix/$hash or whatever). Then when you want to retrieve the value by key, you hash the key, lookup the path, check if it exists, and pull out the value. Here the advantage of the stable hash is you don't need to rely on the details of how timespans print or whatever, and your key can be detailed/big and still get a reasonably sized file path.

This naive cache is incorrect in general because of the possibility of hash collisions (which is largely independent of StableHashTraits - the possibility always exists) but cannot be fixed at the level of the hashing algorithm. Instead it must be handled at the level of the cache itself. E.g. by storing the key in some way alongside the value, and when you go to retrieve the value, first pull out the key and check there was no collision. (If there was a collision, you can just recompute the value without caching it, or also do something much more complicated to store both and disambiguate between them).

The reason this connects to this issue, is that I think one reason folks want "no collisions" is so they can do this naive caching, but it's just not possible to do correctly regardless of how nuanced StableHashTraits disambiguates values. All that does is change the probabilities from rare to expected/common, not eliminate them. So IMO we should make it easy to know where on that spectrum things are, but it's not actually that important to push "everything" to the rare side, since we cannot push things to "impossible" and that's where the big gains would be.

haberdashPI commented 9 months ago

@kleinschmidt: I wonder if you have any opinions here. When I was thinking through the latest design with Spencer you chimed in on that conversation saying that your own view was that any non-equal object should have a non-equal hash.

To be fair, I think I worded this a little inaccurately, I think the actual request was for the input to the hash algorithm to be different in this case.

That said, I agree with all of what's been said: use of StableHashTraits should not assume there are zero collisions, and a transparent way of understanding what will and will not lead to identical hashes seems good. In any of the ways I've used it for and in the way Makie uses it, if something serializes to the same value, that is a sufficient definition of equality for the hash to have the same input (and thus hash consistently to the same value).

haberdashPI commented 9 months ago

Regarding 1-3 in the OP:

Perhaps a sufficient "meta" use case (one that fits all of the examples I can think of) are those where a hash can safely be made consistent with the serialized value of the object according to StructTypes (with the options to transform it, or whatever that gets called). I think this covers:
- My use of this for function caching (in beacon internal repos)
- Makie's use case
- Various beacon internal preprocessing pipeline stuff

2-3. I think these points would be well covered by a design roughly like this:

Keep hash_method(::MyType, ::Context), have the fallback when no specialized value for Context is defined be the defaults from StructTypes of MyType. This maintains the flexibility of the API and the ability to avoid type piracy.
Keep FnHash as a possible return value for hash_mehtod: this will first transform the value (based on a function passed as the first argument) and then compute the hash_method of the transformed value (and, similar to the current API, it should take an optional second argument that is the StructType value to use as for the transformed value, in cases where not doing so would lead to infinite recursion).

(Optionally one could keep ContextAndHash up there as well, as it is low maintenance)

If this sounds good to others I'm happy to try out an implementation like this.

haberdashPI commented 9 months ago

I also think instead of generated functions, some explicit memoization with a WeakKeyDict of object to hash would make sense. IIUC that would fix the issue seen in https://github.com/beacon-biosignals/StableHashTraits.jl/issues/26#issuecomment-1664795936 since the transitions would only have to be hashed once. I think then Makie wouldn't have to do it's own memoization as well. (Here, having "weak" keys is important, so the dict itself doesn't hold onto references it shouldn't, otherwise we have a memory leak).

One thought here: in principle one can define new methods between calls to stable_hash so I think maintaining a dict of object hashes is risky. A less risky choice would be to implement caching per call to stable_hash. (And if you want to memoize its value you can do that, but that is separate from the internals of dealing with e.g. multiple references to large objects).

I'd see the dict of type values (if that is needed at all anymore) as separate from the dict of cached object values, since the former is something you'd want to save across calls to stable_hash (unlike objects, the hashed id of a type shouldn't really change).

ericphanson commented 9 months ago

One thought here: in principle one can define new methods between calls to stable_hash so I think maintaining a dict of object hashes is risky. A less risky choice would be to implement caching per call to stable_hash. (And if you want to memoize its value you can do that, but that is separate from the internals of dealing with e.g. multiple references to large objects).

One option here is to provide a Context that holds a cache, and allow the user to re-use that context if they want to do so (taking responsibility for invalidation)

ericphanson commented 9 months ago

Keep FnHash as a possible return value for hash_mehtod: this will first transform the value (based on a function passed as the first argument) and then compute the hash_method of the transformed value (and, similar to the current API, it should take an optional second argument that is the StructType value to use as for the transformed value, in cases where not doing so would lead to infinite recursion).

Do we need FnHash? It seems more complicated and that there are multiple levels at which to injection things here. What about:

stable_hash(::MyType, ::Context): entrypoint, calls _stable_hash(in_progress_hash, ::MyType, ::Context)
_stable_hash(in_progress_hash, ::MyType, ::Context): computes transformed = transform(::MyType, ::Context) then calls _stable_hash_by_struct_type(in_progress_hash, transformed, StructTypes.StructType(transformed), ::Context)
we define methods for _stable_hash_by_struct_type for each StructTypes.StructType, lowering to the children.

In this setup, the only hook that users are allowed is defining transform(::MyType, ::Context) (which defaults to identity); the rest of the methods are internal.

haberdashPI commented 9 months ago

I think the problem with that specific setup is that you can only apply the transform at the top level. E.g. if you transform ::Int in your ::Context it won't get applied to ::Int inside of ::MyType. It is useful to say "in this context I hash dates in this way".

But I'm not sure you're wrong that we could just have the one method transform that gets applied appropriately, I just think that should be done such that it applies to the children of ::MyType as well. That seems more elegant.

ericphanson commented 9 months ago

should be done such that it applies to the children of ::MyType as well.

Yeah, that’s what I meant. In my _stable_hash_by_struct_type I meant to say it should call _stable_hash on each child, so there’s the opportunity to transform it.

haberdashPI commented 9 months ago

Cool, okay. So to summarize:

There will be a single user function to implement that customizes hashing having the signature transform(::MyType, ::Context) that returns the value to hash to represent an object of type MyType (defaulting to identity)
To implement hashing we use StructType(transform(x, context)) and dispatch on the result
We can use a cache defined per context that avoids repeating the hashing of an already seen object (this seems like an optimization that can be done separately from the initial implementation using StructType)

Thinking on where this should live: I actually think it would be possible to define here (as 1.2 and 2.0). For 1.2 I'd add the new stuff as hash version 3 and mark any old methods as deprecated. In a follow-up PR the old pre 1.2 stuff would get deleted for 2.0.

ericphanson commented 9 months ago

That sounds good to me! Some more questions:

Do we want to support hashing of types? E.g. if someone puts obj = (; my_type=Int) and calls stable_hash(obj), should that error?
- I think probably it should unless we have a good way to do it that doesn't rely on internals (or printing, which isn't stable)
- if we do have a way to do it, what about parametric types? Could drop the params if that is easier
- if we are hashing them, do we care where a type is defined? (i.e. fully qualified name). I think probably not.
Do we want to support hashing of modules? E.g. obj = (; my_mod = Base).

Maybe the answers to these are basically "does StructTypes handle it", and if so we can just go with that (?).

Are there situations where naive use of StructTypes is really bad, that we need to special case somehow?

If StructTypes.jl itself changes something, do we commit to stability? I think probably not, I think we should say we are as stable as StructTypes is and we won't special case stuff to preserve stability.

ericphanson commented 9 months ago

Maybe it's worth tagging in @quinnj here: Jacob, we have this package (StableHashTraits) that tries to provide a stable_hash function that gives a hash which is stable across julia sessions and julia versions. Our current version has a heavy maintenance burden because it tries to hash "everything" and depends on internals to do so. We also have a semi-complicated trait based system so the user can tell StableHashTraits how they want their object to be hashed.

I am proposing we defer much of this work to StructTypes. StructTypes can tell us what the relevant "children" are for a given object, and also we won't distinguish between e.g. two different ArrayType's (but we would distinguish between an ArrayType and a DictType with the same fields). This provides additional stability in the face of minor implementation changes (the actual type) and means we don't have to figure out stuff like the fully qualified type name in a stable way.

Does this seem like a reasonable use of StructTypes to you? Or would you recommend against it.

haberdashPI commented 9 months ago

Do we want to support hashing of types? E.g. if someone puts obj = (; my_type=Int) and calls stable_hash(obj), should that error?

I think probably it should unless we have a good way to do it that doesn't rely on internals (or printing, which isn't stable)

Both nameof and parentmodule are public API so I think types work just fine. I think this is wroth doing. This is what stable_typename_id and qualified_name use.

if we do have a way to do it, what about parametric types? Could drop the params if that is easier

From what I understand there is no public API way to iterate over the type parameters of a type. There is a field (parameters) of DataType that gives you these, and it has been around for a while (at least since 1.0). This is how I created the stable type string in #43.

if we are hashing them, do we care where a type is defined? (i.e. fully qualified name). I think probably not.

I'm inclined to think it could matter. One can imagine very generic names (e.g. Point, List) having different implementations and meanings across modules.

Do we want to support hashing of modules? E.g. obj = (; my_mod = Base).

nameof and parentmodule are both defined for Module, so I'd say yes.

Are there situations where naive use of StructTypes is really bad, that we need to special case somehow?

There are a lot of tests in this repository now. One first step would be to see how StructType plays out in the tests that I've have been written here.

haberdashPI commented 1 month ago

Closed by #58