dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
742 stars 295 forks source link

Support for k>32 #1426

Closed standage closed 7 years ago

standage commented 7 years ago

Supporting k values > 32 has been on the table for a while (since #27 and #60 in 2013). Some progress was made on this with #624, but this PR has stalled.

I'd like to suggest that we prioritize this feature.

From my reading of the relevant threads, there seem to be a couple of possible approaches and concerns.

Am I reading things correctly? Anything I've missed?

ctb commented 7 years ago

Yep, seems about right. The reversibility is the biggest problem; for probably 95% of the code that we actually currently use, we could go straight for a cyclic hash of some sort and it wouldn't matter, because all we're doing is indexing into hash tables. But there is a lot of code that relies on reversibility and we have to figure out what to do with it, even if it's not necessarily highly used code at the moment. Unfortunately I am also probably the primary expert on this code (the partitioning code)...

My primary concern with support for multiple hash functions is around performance, note; virtual functions and the like can give significant performance hits.

I suspect there are some relatively small refactorings that could be done to make this all much easier, but I don't know enough C++ to be confident in the right approach here.

camillescott commented 7 years ago

On Aug 30, 2016 7:17 AM, "C. Titus Brown" notifications@github.com wrote:

Yep, seems about right. The reversibility is the biggest problem; for probably 95% of the code that we actually currently use, we could go straight for a cyclic hash of some sort and it wouldn't matter, because all we're doing is indexing into hash tables. But there is a lot of code that relies on reversibility and we have to figure out what to do with it, even if it's not necessarily highly used code at the moment. Unfortunately I am also probably the primary expert on this code (the partitioning code)...

My primary concern with support for multiple hash functions is around performance, note; virtual functions and the like can give significant performance hits.

I suspect there are some relatively small refactorings that could be done to make this all much easier, but I don't know enough C++ to be confident in the right approach here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dib-lab/khmer/issues/1426#issuecomment-243453801, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwxrYRn4iuLeF2n20WA-ck1O7K815k7ks5qlDt3gaJpZM4JuWt6 .

standage commented 7 years ago

Regarding the first point, I think I grok the idea of hashing in pieces (similar to the ARCH_32 approach taken in Mash's hash function?). What I'm not so clear on is how to then cast these multiple values as a single value to use as a table index. Probably very simple, but not clear to me at the moment.

Regarding the second point, I did a bit of "research" earlier today and also found templates (and "functors") to be the most performant alternative to the function pointer vs virtual function question. I have zero experience with C++ templating though, so I hope @camillescott and @luizirber don't mind it I bug them with questions on the topic!

standage commented 7 years ago

The templating will also come in handy with respect to providing BAM support. SeqAn has a BAM parser in addition to the Fast[a|q] parser already in use, and I'm sure IParser::imprint_next_read is called enough that we want to avoid any potential virtual function overhead there.

betatim commented 7 years ago

Cyclic hash == https://en.wikipedia.org/wiki/Rolling_hash ?

Is there a setup to benchmark things (in terms of cache misses, etc) as the hash functions are swapped around? Because: measuring gives you a leg up on experts who are too good to measure

@ctb where would you start looking for the "multiple small refactorings"?

ctb commented 7 years ago

On Wed, Aug 31, 2016 at 09:19:03AM -0700, Tim Head wrote:

Cyclic hash == https://en.wikipedia.org/wiki/Rolling_hash ?

https://github.com/dib-lab/khmer/pull/624

Is there a setup to benchmark things (in terms of cache misses, etc) as the hash functions are swapped around? Because: measuring gives you a leg up on experts who are too good to measure

No benchmarking setup, and yes, I think adding some benchmarking is the place to start.

This paper of ours (https://github.com/dib-lab/2013-khmer-counting) might be a place to look for benchmarking. And/or @luizirber has some benchmark data sets for his HLL paper, http://biorxiv.org/content/early/2016/06/07/056846.

@ctb where would you start looking for the "multiple small refactorings"?

Sadly it probably involves some diagramming and class structure investigation... The core idea is that the Hashtable class contains support for both the core hashing and also the partitioning code (SubsetPartition etc.) - this latter code is complex, somewhat ugly, and not that widely used (although it is rather well tested IMO). We can leave that last code at k <= 32 without any problem, if we can figure out how to separate the hash function requirements out...

It's a bit of a tangled mess ATM, is what I'm saying.

ctb commented 7 years ago

As a trial, I removed most of the code that depends on reversibility in #1431, and then eliminated the sliding window optimization and _revhash code in #1432. Almost all tests pass at this point! Then #1433 swapped in murmurhash as the default hash function, which of course breaks a bunch of stuff. Most of it looks like stuff that should be broken (and hence should be removed or refactored when using different hash functions), but there are a few things that bear closer examination.

At this point, we should be able to re-enable the sliding window optimizations in KmerIterator with any rolling hash function. For example, this one https://github.com/lemire/rollinghashcpp/blob/master/example.cpp should work.

ctb commented 7 years ago

Note, one of the problems was that our DNA sanitizing code was broken in places. Sigh. See #1434.

ctb commented 7 years ago

Anyway, a question for the C++-ofiles amongst us is: how hard would it be to refactor the C++ code base so that the code removed in #1431 was isolated in a ghetto with its own hash function? The critical changes are the four member structures removed from Hashtable (in hashtable.hh), which could be placed in a subclass or a friend class with all of the dependent functions quite easily I suspect.

ctb commented 7 years ago

OK, pretty much done playing around with hash code refactoring for now :). Here's my summary.


Ignoring the partitioning code (test-removed in #1431), there are three kinds of things we need to do:

  1. lossy hashing of individual k-mers for use in Bloom filters & CountMin sketches. Essentially any hash function will do here.
  2. lossy hashing of all k-mers in a string, quickly, to load many k-mers into Bloom filters & CountMin sketches. Here, any rolling hash will do. The class StringToHashIterator in #1437 shows how to do this generically.
  3. index into & traverse the graph from k-mers. This logic is all encoded in the class Traverser, in lib/traversal.*. This is the tricky one, because we need to be able to track the exact k-mer in order to prepend or extend. The class KmerFactory shows how to build these, and the class KmerIterator does so using a rolling hash function. The main cost here appears to be in tracking the exact k-mer efficiently - doing so with a string is (of course) super slow.

The first and second points can be dealt with easily, and are more or less done in #1437.

For the third point, it seems to me like @camillescott's comment is a good way to go: implement a fast new data type that lets you store large k-mers exactly in 2-bit form, and go back and forth to strings when needed. The important operations here would be prepending and appending individual bases (to support get_right and get_left in Traverser) along with the machinery to track and update a rolling hash value (essentially as Traverser::get_left and ::get_right does in master).

With only a little extra effort, this new data type might be able to resolve the issues in the partitioning code as well, although I am OK with ignoring that for the moment.

If @camillescott can take a look at all of this in the next few days and check me, maybe we can schedule a chat about it online with @betatim and @standage so that we're all on the same page and then figure out where to go from there?

ctb commented 7 years ago

For the binary encoding for DNA, we might use 3 bits per DNA base instead of just two, or use a more complicated encoding overall; it would be nice to potentially allow for Ns in the sequence.

ctb commented 7 years ago

Talking with @camillescott, one question is whether or not we should support lossy hashing at all, or just go with the bitstring encoding and reversible hashing needed for Traverser. I'm not sure - my suspicion is that the lossy hashing code could be much faster and consume less memory. In particular, the HashSet code is being used to store many bags of 100k+ k-mer hashes for compact De Bruijn graph work, and we can probably use 32-bit integers to store those, which could end up saving a lot of space.

I'm not clear on how to do this, but if we can build a simple functor interface such that we can specify when we need reversibility, we can probably fit it all in without too much complexity.

ctb commented 7 years ago

Also: for the partitioning code, we need to support STL set membership, so I think all we need to implement is a perfect multi-long-long bitstring encoding that supports == and <.

ctb commented 7 years ago

Hmm, actually, we need to support the following uses of HashIntoType:

typedef std::set<HashIntoType> SeenSet;
typedef std::map<HashIntoType, PartitionID*> PartitionMap;
typedef std::queue<HashIntoType> NodeQueue;
typedef std::multimap<HashIntoType, Label> TagLabelMap;
typedef std::pair<HashIntoType, Label> TagLabelPair;
typedef std::pair<Label, HashIntoType> LabelTagPair;

so we need < (weak ordering), copying, and assignment, I believe.

ctb commented 7 years ago

I am also unclear on what, exactly, needs to be done on the Python side to support something like the reversible Kmer class. Right now we take full advantage of the mapping between C++ HashIntoType and Python native UnsignedLongLong (ParseTuple type "K"). Perhaps another advantage of the lossy hashing approach is that for we won't need to make the Python API more complicated for it.

ctb commented 7 years ago

Irreversible hashing for k > 32 added internally in #1511, along with not-incredibly-painful support for different hash functions. Now, on to the script support! (#1623 and friends)