Storing reversible hashes for k>32

dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more

http://khmer.readthedocs.io/

Other

756 stars 295 forks source link

Storing reversible hashes for k>32 #1442

Closed betatim closed 7 years ago

betatim commented 8 years ago

Place for thoughts and progress on representing reversible hash values of (arbitrarily) large k-mers.

This is one of the things needed for #1441.

First idea from @camillescott is to switch from using a single int64 to a struct of integers.

standage commented 8 years ago

Another idea discussed was using the 128 bit integer implementation (__int128) discussed here.

Pros:

sticking with a "native" data type (on GCC and Clang at least)
could support quick, reversible hashing up to k=64 without perturbing the codebase too much

Cons:

cannot support arbitrary k
may be issues with non-standard compilers, OSes
there may be good reasons to support different hash functions anyway, so this would just be kicking the can down the road

Probably not the best idea in the long term, but worth bringing up.

betatim commented 8 years ago

Some experiments in this gist. This uses a int128 and int256 implementation I found by googling. They were useful to quickly get started. Having 33-mers seems to "just work" if you have a bigger int type available.

The second thing I investigated is how to transport these big integers into python land. Python's integer type has no limit and you can create them from an array of bytes _PyLong_FromByteArray (which despite underscore seems to be fairly stable). The reverse _PyLong_AsByteArray also exists. On the pure python side there is int.from_bytes and int.to_bytes which use these.

After this digging around I think implementing the big-ints with a byte array instead of a pair of smaller integers is the way to go given we need to make those arrays for going to/from python.

betatim commented 8 years ago

My only worry about using a byte array is that it might be slower?? I will make a small benchmark pitting the uint128_t from the gist against one based on an array to answer that.

Other remark: int.{to,from}_bytes is not available in legacy python. The C API however has been there since way back when the tim bot was still active.

ctb commented 8 years ago

blinks Well, this looks pretty awesome... What's the performance hit?

betatim commented 8 years ago

I don't know yet ;) but I might have just achieved one of the tests passing after switching _hash to use a class that uses a byte array to store stuff.

ctb commented 7 years ago

in #1519 and friends => too slow. #wontfix

ctb commented 7 years ago

(#1511 added irreversible hashing for k > 32)