dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
756 stars 295 forks source link

Plan for hash function refactoring #1441

Closed ctb closed 7 years ago

ctb commented 8 years ago

Problem under discussion:

The two proximal use cases are Daniel’s need for k > 32 for human genomics work, and Camille’s work on assembly and traversal. Titus is also interested in supporting k > 32 for the compact De Bruijn graph work.

See: https://github.com/dib-lab/khmer/issues/1426

A few notes:

Current proposal:

TODO items:

First, Camille to trial template/functor code separation with Tim’s active kibbitzing. Maybe at this point we can support irreversible k > 32 and merge that in? For consideration.

Simultaneously, Tim to do bitstring representation implementation with existing and/or new hash function to support reversible k > 32 among other stuff. (#1442)

Pause & review at that point.

Titus may see about splitting out the partitioning code.

Goal is to do things in small mergeable chunks.

Questions/comments:

ctb commented 8 years ago

Bump - could we get a brief status update from @betatim and @camillescott?

betatim commented 8 years ago

From my side the next thing to merge in order to have >32-mers is #1455. This gives us storage for hashes >64bit but it is slower. I think I made a mistake with my benchmark in https://github.com/dib-lab/khmer/pull/1444#issuecomment-247038018 which suggests building our own BigHashType is a waste of time and we should use std::bitset. In particular for large numbers of bits like 128.

The blocker for all things >32mer is that it slows things down. I propose to make a duplicate of #1455 which uses std::bitset and then compare performance of the these two vs uint64 to decide what to do.

betatim commented 8 years ago

The problem with using bitset is that we need to do some digging to be able to convert it to a PyLong. There is no easy way to access the underlying memory to feed to PyLong_FromBytes.

ctb commented 8 years ago

On Mon, Oct 10, 2016 at 01:39:26AM -0700, Tim Head wrote:

From my side the next thing to merge in order to have >32-mers is #1455. This gives us storage for hashes >64bit but it is slower. I think I made a mistake with my benchmark in https://github.com/dib-lab/khmer/pull/1444#issuecomment-247038018 which suggests building our own BigHashType is a waste of time and we should use std::bitset. In particular for large numbers of bits like 128.

The blocker for all things >32mer is that it slows things down. I propose to make a duplicate of #1455 which uses std::bitset and then compare performance of the these two vs uint64 to decide what to do.

This sounds good to me.

betatim commented 8 years ago

On the inside bitset also packs the bits into integers, just like the homebrew solution. My bias would be to use something maintained by someone else (bitset).

Either of the two is ready for reviewing, but we should decide which one. Not sure how to make them faster right now as profiling them suggests all the time is spent in << and >>, so short of doing less of those I'm not sure.

The take away from the benchmark numbers is that it is hard to beat bitshifting integers. That is why master is miles ahead. Is the speed difference big enough that we want to cook up something that uses plain uint64 for k<=32 and only switches to the more expensive type for k>32? How close to a real science usecase is the benchmark (would conclusions change if we had to read stuff from disk?)? Benchmark script says:

master 🏆

K=32
including conversion to PyLong
[1.7609929150021344, 1.750087672000518, 1.7280221319997509]

consume only
[1.2745242400014831, 1.2970973509982286, 1.2887841869996919]

bitset

K=32
including conversion to PyLong
[37.2649885949977, 37.072750736999296, 37.13048480799989]

consume only
[9.170250012997712, 9.15772684500189, 9.391962263998721]

bighashtype

K=32
including conversion to PyLong
[6.496041435999359, 6.391305409997585, 6.418443644000945]
consume only
[5.0048870949976845, 4.986413504000666, 4.950539015000686]
betatim commented 8 years ago

@luizirber do you have any thoughts on why there is such a difference in terms of speed between these two?

ctb commented 7 years ago

Basics are now done (see #1511).