GuillaumeHolley / BloomFilterTrie

An alignment-free, reference-free and incremental data structure for colored de Bruijn graph with application to pan-genome indexing.
MIT License
43 stars 6 forks source link

Length k (for k-mers) must be a multiple of 9 #3

Closed Phelimb closed 8 years ago

Phelimb commented 8 years ago

Why is it that kmers need to be a multiple of 9 ?

It's not entirely obvious to me from reading the paper if this is a theoretical limitation or a practical one?

I can see from the code that kmer size needs to be a multiple of the NB_CHAR_SUF_PREF - can this be arbitrarily modified?

GuillaumeHolley commented 8 years ago

Hi Phelim,

Sorry for the late answer (holidays).

There is indeed this limitation that k must be a multiple of 9 and this is a purely practical limitation in the current version of the BFT. It allows to use a fixed length of suffix prefixes to store in compressed containers for an optimal usage of the memory and facilitate a lot of computations for a better insertion / look-up time.

Indeed, the k-mer size in the code is a multiple of the (macro definition) NB_CHAR_SUF_PREF (which is 9) but cannot be modified: It is like this in the code for better code reading / understanding.

Let me know if you have other questions ;)

Guillaume

Phelimb commented 8 years ago

Thanks! I will ping you if I have other questions.