Closed betatim closed 7 years ago
Another idea discussed was using the 128 bit integer implementation (__int128
) discussed here.
Pros:
Cons:
Probably not the best idea in the long term, but worth bringing up.
Some experiments in this gist. This uses a int128 and int256 implementation I found by googling. They were useful to quickly get started. Having 33-mers seems to "just work" if you have a bigger int type available.
The second thing I investigated is how to transport these big integers into python land. Python's integer type has no limit and you can create them from an array of bytes _PyLong_FromByteArray
(which despite underscore seems to be fairly stable). The reverse _PyLong_AsByteArray
also exists. On the pure python side there is int.from_bytes
and int.to_bytes
which use these.
After this digging around I think implementing the big-ints with a byte array instead of a pair of smaller integers is the way to go given we need to make those arrays for going to/from python.
My only worry about using a byte array is that it might be slower?? I will make a small benchmark pitting the uint128_t from the gist against one based on an array to answer that.
Other remark: int.{to,from}_bytes
is not available in legacy python. The C API however has been there since way back when the tim bot was still active.
blinks Well, this looks pretty awesome... What's the performance hit?
I don't know yet ;) but I might have just achieved one of the tests passing after switching _hash
to use a class that uses a byte array to store stuff.
in #1519 and friends => too slow. #wontfix
(#1511 added irreversible hashing for k > 32)
Place for thoughts and progress on representing reversible hash values of (arbitrarily) large k-mers.
This is one of the things needed for #1441.
First idea from @camillescott is to switch from using a single
int64
to a struct of integers.