dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
753 stars 296 forks source link

Add other binsizes into hashtable than 1 bit and 8 bits. #51

Open ctb opened 11 years ago

ctb commented 11 years ago

Right now, Hashbits supports 1 bit per hashtable entry and CountingHash supports 8 bits per hashtable entry. 2 and 4 bits/entry would be great extensions that could enable substantially decreased memory usage in some useful circumstances; not sure if its worth doing it for arbitrary numbers of bits, or if there would be performance penalties.

emcd commented 11 years ago

We can exploit some efficiencies in the 1-, 2-, 4-, and 8-bit cases because those uniformly "tile" a byte. The other cases do not uniformly tile a byte (e.g., the first 3-bit value appears at bit 0 of byte 0, the second 3-bit value appears at bit 3 of byte 0, the third at bit 6 of byte 0, but the fourth appears at bit 1 (and not bit 0) of byte 1). Bitmasks are thus position-dependent in the latter cases, whereas they are either position-independent or unneeded in the former cases.

If one wanted a generalized solution (for n-bit counters, where n >= 1), then I would probably create a template with the generalized bitmask calculation and then create specializations of the template to selectively override this in the cases where we know we can do better.

emcd commented 11 years ago

Related to issue #27. As I see it, the choice of hash functions is independent of counter memory model. We may wish to split the hash table design into the hashing method used and accesses to the counter memory, since those are essentially orthogonal operations. Splitting the read parsers into data pumps and parsers helped us; I believe that a similar split for hashing and counters, as proposed above, would yield benefits as well.

ctb commented 11 years ago

Agree


C. Titus Brown, ctb@msu.edu

On Apr 22, 2013, at 18:55, Eric McDonald notifications@github.com wrote:

Related to issue #27. As I see it, the choice of hash functions is independent of counter memory model. We may wish to split the hash table design into the hashing method used and accesses to the counter memory, since those are essentially orthogonal operations. Splitting the read parsers into data pumps and parsers helped us; I believe that a similar split for hashing and counters, as proposed above, would yield benefits as well.

— Reply to this email directly or view it on GitHub.

ctb commented 9 years ago

1035 proposes using the MiniFloat standard to do this.

betatim commented 7 years ago

1551 implements a storage class with 4bits per entry