Hashing over serialization: MurMur, xxHash, Base64, MD5

qwwdfsad commented 6 years ago

In branch murmur_hash I've introduced a prototype which leverages serialization infrastructure to implement Guava-like 128-bit MurMur3 hash. The main idea is to provide KOutput which hashes given object instead of writing it somewhere.

The idea looks promising for the following reasons: 1) Once written, Hasher can be used from JVM, JS and native 2) KOutput desing allows using Hasher for hashing both standalone objects and streams of objects 4) End users should not write adapters for their classes to be hashable. Moreover, using hasher doesn't introduce any additional classes or generated code.

For example, in Guava for class data class Person(id: Int, name: String) to be hashable user should provide Funnel implementation which deconstructs Person into primitive fields. So one should write something like Hashing.murmur3_128().newHasher().putObject(person, Funnel<Person> { from, sink -> sink.putInt(id).putString(name)}).makeHash().asLong(). Usually this funnel should be cached in a class field somewhere, bytecode size and methods count become larger, and things get even worse when class contains non-primitive fields.

The current design allows to write MurMur3_128Hasher().longHash(person) as long as person is serializable.

Performance looks even more promising: we don't have to allocate intermediate object and KOutput can be easily reused.

Benchmark results, compared with Guava simple data class:

HashingBenchmark.guavaHash       thrpt    5   8.982 ± 1.156  ops/us
HashingBenchmark.kxHash          thrpt    5  18.766 ± 4.031  ops/us
HashingBenchmark.kxHashReusable  thrpt    5  20.549 ± 0.636  ops/us

Now we should understand whether community and kotlinx.serialization users are interested in this functionality and react accordingly.

LouisCAD commented 6 years ago

Is there a reason there's no optional ByteArray parameter for the makeByteArrayHash or a variant of it that just takes a ByteArray and returns Unit ? It could save a few allocation (don't know if it's worth it)

qwwdfsad commented 6 years ago

I was thinking about making KOutput (and its underlying byte array) reusable, but adding new overload is also possible

LouisCAD commented 6 years ago

To ensure zero ambiguity, I'm talking about using a ByteArray parameter to avoid allocation as done here: https://github.com/Kotlin/kotlinx.serialization/blob/a2277b377f2b57ffeb96da559222acb53d3918a4/runtime/jvm/src/main/kotlin/kotlinx/serialization/hashing/MurMur3_128Hasher.kt#L82

I don't know if it's worth it as I lack knowledge about the use cases though.

qwwdfsad commented 2 years ago

With @MetaSerializable we can do even better and introduce a @Hashable annotation that reflects the intent much better

rnett commented 1 year ago

Ended up doing something similar myself, using a custom encoder that writes to a DataOutput, which in practice is a DataOutputStream wrapping a DigestOutputStream. No idea how performant it is at this point, am just trying to get it working. A couple trouble points I ran into:

I would like to have a Hashable interface where anything extending it can be hashed without annotations. This is not possible, and seems like a design decision, which makes sense when deserialization is considered, but that is obviously not a consideration here. Funnily enough polymorphicDefaultSerializer works as long as the type I'm trying to serialize is an interface, but not if it's concrete.
I would like to have a @HashWith annotation that can point to a class that implements Hasher. Much like a parallel form @Serializable that doesn't include descriptors or deserialization. This is not possible due to https://github.com/Kotlin/kotlinx.serialization/issues/1127 (@MetaSerializable doesn't accept a serializer class). I've ended up making Hasher implement KSerializer and throw on deserialization, so you can use @Serializable, which works. But it's awkward and the naming leaves much to be desired.

Kotlin / kotlinx.serialization

Hashing over serialization: MurMur, xxHash, Base64, MD5 #163