kmer classification against multiple separate DBs vs single DB

ramiroricardo commented 10 months ago

Hi all,

We are working on a problem in which we would like to extract the taxonomic assignments of each kmer and just ignore the full sequence. We had thought of building one single database that includes all the genomes that we care about, but we are not sure we will be able to do this, due to computational requirements. However, given that kraken is using exact kmer matches, I am wondering if this matters? So for example, if I do classification of some sequences against two separate DBs or against a single DB that has all the genomes that were present in the other two, I think I should get the same results. Is this expectation correct?

thanks

salzberg commented 10 months ago

No, you might not get the same results if you use 2 separate, non-overlapping DBs. For example, if a k-mer is present in both DBs but in just 1 species in each one, then in the full DB that k-mer will be assigned to the lowest common ancestor of the 2 species. In the 2 separate DBs, the k-mer will be assigned at the species level to 2 different species.

ramiroricardo commented 10 months ago

Hi @salzberg thanks a lot for the quick reply and for clearing my misunderstanding.

DerrickWood / kraken2

kmer classification against multiple separate DBs vs single DB #757