dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
749 stars 294 forks source link

K-mer banding, bit patterns, and the fast (2-bit) hash #1761

Open standage opened 7 years ago

standage commented 7 years ago

It was pretty clear early on in testing the banding functionality that it would not play nicely with the 2-bit hashing scheme in khmer. My first approach to banding was to look at bit patterns (for band i \in 2^N, do the N least significant bits in this k-mer match i?), but since have settled on more flexible numeric range tests. In either case, the direct two-bit encoding of sequence content is problematic. The fact that two nearly identical sequences can hash to two nearly identical integer values is kind of antithetical to the idea of hashing. Things worked much more nicely when we used banding on murmur3-hashed k-mers.

During a chat this morning, @ctb and I entertained the possibility of revisiting the idea of banding 2-bit hashed k-mers using a modified bit pattern approach. Rather than looking at the least significant bits of the hashed k-mer, could we test against an arbitrary bit pattern across all 64 bits?

For this to work, we'd need to satisfy the following criteria.

So, is this scheme viable? How would we generate the bit patterns and test k-mers against a given bit pattern?

ctb commented 7 years ago

I'm now skeptical this could work; I'm not sure how well something like XOR would work to randomize things.

standage commented 7 years ago

Just to be clear: the benefits of banding are manifold and not discussed here. The benefit of using the 2-bit hash is that it's much faster than MurmurHash3, and compatible with the graph stuff.

betatim commented 7 years ago

An idea that sacrifices speed but might be simple to implement: feed the 2bit hash value to murmurhash and use the output to decide the band a kmer is in. In the sketch you still store stuff with the 2bit hash so you keep the graph functionality. Edit: Maybe you can use a 16bit (or similarly low bit) murmurhash version and improve the speed a bit?

I think "generate two 'unrelated' values from two very similar inputs" is the definition of what a hash function does. So it is unlikely we will discover a different way of doing this?

What surprised me while doing some benchmarking for #1760 is that FNV hash seems no faster than murmurhash?? Or at the very least my benchmark is too noisy to see a speed up. It seems like you'd be hard pressed to come up with an even simpler hashing scheme than FNV in terms of mathematical operations.