IreneKnapp / direct-murmur-hash

MIT License
0 stars 0 forks source link

Consider using hashabler to support hashing arbitrary types? #3

Open jberryman opened 8 years ago

jberryman commented 8 years ago

Would you be interested in merging changes that permitted hashing of arbitrary Hashable types from my hashabler package? The resulting hash functions would look something like:

hash :: Hashable a=> a -> Hash

where a may also be ByteString. This would allow hashing of arbitrary haskell values in a performant way (without needing to marshal to a bytestring first), and cross-platform consistent way, etc.

I'm not sure when I'd be able to get to this, but I wanted to float the idea first. Let me know if you're interested and would like more details about what that might look like.

IreneKnapp commented 8 years ago

I looked through your package and I do think it's got a reasonable goal. In principle, I'm willing to be persuaded that this is a good idea.

I'm a little conflicted as to whether it's really best to do this in a way that makes each hash-function package depend on hashabler - that's how this would have to work, right? My concern is primarily that there is more than one strategy possible for serializing Haskell values, and in fact there's a lot of packages existing that do that in various ways for various purposes, and there are a lot of design pitfalls which they can fall into. Picking a point in that design space and tightly coupling a hash-function implementation to it is somewhat unappealing. I know that users who would prefer their own way could go via ByteString, but that doesn't really make me happy about adding a dependency that some users will have to work around.

Thinking about it more, it occurs to me that using a Hashable ByteString instance as the "raw" interface is a bit strange. It works, given that your particular recommended serialization strategy isn't attempting to ensure that hashed values from a ByteString input would always be distinguishable from hashed values of, say, a Text input. But that's conceptual overhead for users.

I think, also, there might be a performant solution that allows serialization-for-hashing packages to depend on hash-function packages, instead of the reverse - namely, lazy ByteStrings. They're designed to have minimal time overhead in this sort of scenario which involves concatenating many small pieces. In an ideal world where everyone has infinite time to spend on such efforts, I'd love to see an empirical comparison of the two approaches.

jberryman commented 8 years ago

I'm a little conflicted as to whether it's really best to do this in a way that makes each hash-function package depend on hashabler - that's how this would have to work, right?

Right. The idea is to improve upon hashable by decoupling serialization-for-hashing from folding-over-bytes-with-a-hash-loop.

An alternative if hash function package authors didn't want to depend on hashabler would be for them to write their functions on lazy bytestrings (as I think you were suggesting) hash :: L.ByteString -> Hash. hashabler could easily expose a serialization function :: Hashable a=> a -> L.ByteString. That would just be less efficient.

The other benefit is hashable provides a disciplined approach to keeping track of compatibility of hashes and serialized hash-based data structures.

My concern is primarily that there is more than one strategy possible for serializing Haskell values

Can you elaborate on that? It seemed to me that in the context of hashing we care that 1) values are serialized to consistent bytes across platforms, and 2) only relevant parts of the input are hashed.

Thanks for the feedback!

IreneKnapp commented 8 years ago

Sorry to leave so long between replies - I was waiting until I had time to reply to everything. So, more when I can, I guess.

When I say more than one strategy possible, I mean that sometimes you're implementing a spec that requires your hashed values to agree the ones another system will produce. So you might find that a provided Hashabler instance for a widely-used type isn't suitable for you. Of course, you can always escape the default behavior using a newtype or by explicitly converting to a ByteString. So I'm trying to decide how I feel about it.