dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
756 stars 295 forks source link

Storing reversible hashes for k>32 #1442

Closed betatim closed 7 years ago

betatim commented 8 years ago

Place for thoughts and progress on representing reversible hash values of (arbitrarily) large k-mers.

This is one of the things needed for #1441.

First idea from @camillescott is to switch from using a single int64 to a struct of integers.

standage commented 8 years ago

Another idea discussed was using the 128 bit integer implementation (__int128) discussed here.

Pros:

Cons:

Probably not the best idea in the long term, but worth bringing up.

betatim commented 8 years ago

Some experiments in this gist. This uses a int128 and int256 implementation I found by googling. They were useful to quickly get started. Having 33-mers seems to "just work" if you have a bigger int type available.

The second thing I investigated is how to transport these big integers into python land. Python's integer type has no limit and you can create them from an array of bytes _PyLong_FromByteArray (which despite underscore seems to be fairly stable). The reverse _PyLong_AsByteArray also exists. On the pure python side there is int.from_bytes and int.to_bytes which use these.

After this digging around I think implementing the big-ints with a byte array instead of a pair of smaller integers is the way to go given we need to make those arrays for going to/from python.

betatim commented 8 years ago

My only worry about using a byte array is that it might be slower?? I will make a small benchmark pitting the uint128_t from the gist against one based on an array to answer that.

Other remark: int.{to,from}_bytes is not available in legacy python. The C API however has been there since way back when the tim bot was still active.

ctb commented 8 years ago

blinks Well, this looks pretty awesome... What's the performance hit?

betatim commented 8 years ago

I don't know yet ;) but I might have just achieved one of the tests passing after switching _hash to use a class that uses a byte array to store stuff.

ctb commented 7 years ago

in #1519 and friends => too slow. #wontfix

ctb commented 7 years ago

(#1511 added irreversible hashing for k > 32)