Plan for hash function refactoring

ctb commented 8 years ago

Problem under discussion:

We want to support multiple hash functions in khmer.
Some uses of the hash function require reversibility, others do not:
Partitioning requires exact storage of some k-mers
Traversal requires reversibility of some k-mers
We are concerned about performance so things like virtual functions are being approached with care.
We want some of the hash functions to support k > 32. It is not clear if the reversible ones need to support k > 32, at least in the short term.

The two proximal use cases are Daniel’s need for k > 32 for human genomics work, and Camille’s work on assembly and traversal. Titus is also interested in supporting k > 32 for the compact De Bruijn graph work.

See: https://github.com/dib-lab/khmer/issues/1426

A few notes:

The partitioning code is a big, unwieldy mess but it is fairly easy to isolate. Titus would like to avoid major refactoring of this code, and would also like to keep it working. But it can do so at k <= 32.
Quite a bit of the current code doesn’t need reversible hashing, and there may be performance reasons to add irreversible hashing as a thing. Maybe there is an opportunity to split the use cases up and align our software architecture appropriately. Or, maybe there is no performance reason - needs to be checked.
- Do you need to be able to use k>32 and assembly/traversal at the same time? (Yes, I think so, in the medium term. Need a hash function that is reversible for k>32?)
- Any rolling hash function can satisfy the performance needs for hashing, Titus believes.
There’s a public relations reason to go with k up to 100 or 150; this might not be reasonable from a performance reason.

Current proposal:

Develop a bitstring representation that is reversible and permits “rolling” updates - (adding a base for traversal). Provisionally, we’d like to see if this can support k > 64 without disastrous performance degradation. Compile time flags that let you fix your upper k size are possible, but add to maintenance burden. (struct of ints to store the hash)
Figure out the appropriate C++ incantations to support multiple hash functions. The current guess is that templating the hashtable class with a functor will be highest performance and sufficiently flexible. If we can get the C++ architecture right, it seems like we can quickly evaluate performance questions at that point.
Titus believes that it is probably ok to trial code in #1431, until we figure out the right architecture.
Is there something we can do in the architecture to more cleanly split out the partitioning code to ease the eventual merge into master of the new hashing code? Note that performance of the partitioning code is not so important at the moment.
Perhaps introduce a new class between hashtable and countgraph?

TODO items:

First, Camille to trial template/functor code separation with Tim’s active kibbitzing. Maybe at this point we can support irreversible k > 32 and merge that in? For consideration.

Simultaneously, Tim to do bitstring representation implementation with existing and/or new hash function to support reversible k > 32 among other stuff. (#1442)

Pause & review at that point.

Titus may see about splitting out the partitioning code.

Goal is to do things in small mergeable chunks.

Questions/comments:

Tim notes that 128-bit integers are potentially available.
- Camille and Daniel are concerned about hardware support on systems that people actually use. Looks like both GCC and Clang support it. http://stackoverflow.com/a/33828720/459780
Do we want/need to support specialized compilers like Intel? This may be something for TACC/XSEDE/etc interaction.
Also: GPU etc.

ctb commented 8 years ago

Bump - could we get a brief status update from @betatim and @camillescott?

betatim commented 8 years ago

From my side the next thing to merge in order to have >32-mers is #1455. This gives us storage for hashes >64bit but it is slower. I think I made a mistake with my benchmark in https://github.com/dib-lab/khmer/pull/1444#issuecomment-247038018 which suggests building our own BigHashType is a waste of time and we should use std::bitset. In particular for large numbers of bits like 128.

The blocker for all things >32mer is that it slows things down. I propose to make a duplicate of #1455 which uses std::bitset and then compare performance of the these two vs uint64 to decide what to do.

betatim commented 8 years ago

The problem with using bitset is that we need to do some digging to be able to convert it to a PyLong. There is no easy way to access the underlying memory to feed to PyLong_FromBytes.

ctb commented 8 years ago

On Mon, Oct 10, 2016 at 01:39:26AM -0700, Tim Head wrote:

From my side the next thing to merge in order to have >32-mers is #1455. This gives us storage for hashes >64bit but it is slower. I think I made a mistake with my benchmark in https://github.com/dib-lab/khmer/pull/1444#issuecomment-247038018 which suggests building our own BigHashType is a waste of time and we should use std::bitset. In particular for large numbers of bits like 128.

The blocker for all things >32mer is that it slows things down. I propose to make a duplicate of #1455 which uses std::bitset and then compare performance of the these two vs uint64 to decide what to do.

This sounds good to me.

betatim commented 8 years ago

1488 - store >64bit hashes in std::bitset
1455 - store >64bit hashes in BigHashType

On the inside bitset also packs the bits into integers, just like the homebrew solution. My bias would be to use something maintained by someone else (bitset).

Either of the two is ready for reviewing, but we should decide which one. Not sure how to make them faster right now as profiling them suggests all the time is spent in << and >>, so short of doing less of those I'm not sure.

The take away from the benchmark numbers is that it is hard to beat bitshifting integers. That is why master is miles ahead. Is the speed difference big enough that we want to cook up something that uses plain uint64 for k<=32 and only switches to the more expensive type for k>32? How close to a real science usecase is the benchmark (would conclusions change if we had to read stuff from disk?)? Benchmark script says:

master 🏆

K=32
including conversion to PyLong
[1.7609929150021344, 1.750087672000518, 1.7280221319997509]

consume only
[1.2745242400014831, 1.2970973509982286, 1.2887841869996919]

bitset

K=32
including conversion to PyLong
[37.2649885949977, 37.072750736999296, 37.13048480799989]

consume only
[9.170250012997712, 9.15772684500189, 9.391962263998721]

bighashtype

K=32
including conversion to PyLong
[6.496041435999359, 6.391305409997585, 6.418443644000945]
consume only
[5.0048870949976845, 4.986413504000666, 4.950539015000686]

betatim commented 8 years ago

@luizirber do you have any thoughts on why there is such a difference in terms of speed between these two?

1488 - store >64bit hashes in std::bitset
1455 - store >64bit hashes in BigHashType

ctb commented 7 years ago

Basics are now done (see #1511).

dib-lab / khmer