Closed ctb closed 7 years ago
Bump - could we get a brief status update from @betatim and @camillescott?
From my side the next thing to merge in order to have >32-mers is #1455. This gives us storage for hashes >64bit but it is slower. I think I made a mistake with my benchmark in https://github.com/dib-lab/khmer/pull/1444#issuecomment-247038018 which suggests building our own BigHashType
is a waste of time and we should use std::bitset
. In particular for large numbers of bits like 128.
The blocker for all things >32mer is that it slows things down. I propose to make a duplicate of #1455 which uses std::bitset
and then compare performance of the these two vs uint64
to decide what to do.
The problem with using bitset
is that we need to do some digging to be able to convert it to a PyLong
. There is no easy way to access the underlying memory to feed to PyLong_FromBytes
.
On Mon, Oct 10, 2016 at 01:39:26AM -0700, Tim Head wrote:
From my side the next thing to merge in order to have >32-mers is #1455. This gives us storage for hashes >64bit but it is slower. I think I made a mistake with my benchmark in https://github.com/dib-lab/khmer/pull/1444#issuecomment-247038018 which suggests building our own
BigHashType
is a waste of time and we should usestd::bitset
. In particular for large numbers of bits like 128.The blocker for all things >32mer is that it slows things down. I propose to make a duplicate of #1455 which uses
std::bitset
and then compare performance of the these two vsuint64
to decide what to do.
This sounds good to me.
std::bitset
BigHashType
On the inside bitset
also packs the bits into integers, just like the homebrew solution. My bias would be to use something maintained by someone else (bitset).
Either of the two is ready for reviewing, but we should decide which one. Not sure how to make them faster right now as profiling them suggests all the time is spent in <<
and >>
, so short of doing less of those I'm not sure.
The take away from the benchmark numbers is that it is hard to beat bitshifting integers. That is why master is miles ahead. Is the speed difference big enough that we want to cook up something that uses plain uint64 for k<=32 and only switches to the more expensive type for k>32? How close to a real science usecase is the benchmark (would conclusions change if we had to read stuff from disk?)? Benchmark script says:
K=32
including conversion to PyLong
[1.7609929150021344, 1.750087672000518, 1.7280221319997509]
consume only
[1.2745242400014831, 1.2970973509982286, 1.2887841869996919]
K=32
including conversion to PyLong
[37.2649885949977, 37.072750736999296, 37.13048480799989]
consume only
[9.170250012997712, 9.15772684500189, 9.391962263998721]
K=32
including conversion to PyLong
[6.496041435999359, 6.391305409997585, 6.418443644000945]
consume only
[5.0048870949976845, 4.986413504000666, 4.950539015000686]
@luizirber do you have any thoughts on why there is such a difference in terms of speed between these two?
std::bitset
BigHashType
Basics are now done (see #1511).
Problem under discussion:
The two proximal use cases are Daniel’s need for k > 32 for human genomics work, and Camille’s work on assembly and traversal. Titus is also interested in supporting k > 32 for the compact De Bruijn graph work.
See: https://github.com/dib-lab/khmer/issues/1426
A few notes:
Current proposal:
TODO items:
First, Camille to trial template/functor code separation with Tim’s active kibbitzing. Maybe at this point we can support irreversible k > 32 and merge that in? For consideration.
Simultaneously, Tim to do bitstring representation implementation with existing and/or new hash function to support reversible k > 32 among other stuff. (#1442)
Pause & review at that point.
Titus may see about splitting out the partitioning code.
Goal is to do things in small mergeable chunks.
Questions/comments: