DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
713 stars 271 forks source link

Clarification about the unclassified reads #234

Open jonbra opened 4 years ago

jonbra commented 4 years ago

Hi,

I need some clarification regarding the unclassified reads, and how to classify reads when my species is not present in the database.

I have sequenced the genome of a green alga (Chlorophyta) that is not present in NCBI. I downloaded the plant, bacteria and protozoa databases, and the vast majority of my reads are Unclassified, with very few classified to plants, chlorophytes and bacteria. Can I interpret this as the unclassified reads belong to my target organism, or should I expect these reads to be classified within Chlorophyta as well?

Thanks!

Jon

rsh249 commented 4 years ago

Can you try building a database from NCBI's 'nt' database? Many taxa are not represented in the plants genomes database (that I think Kraken2 uses for the 'plants' database build).

I would expect reads to map to other members of the Phylum. The material I have used Kraken2 to analyze previously came from metagenomic sources and it is clear that we frequently get reads mapping to sister taxa when the target taxon (suspected) is not in the database.

With reasonable sequence homology it is not too surprising that many kmers would match between related taxa.

Is there anything else strange about the genome of the green alga you are targetting? Repeats, transposons, polyploidy or other variation might explain some difficulties.

jonbra commented 4 years ago

Thanks, we're trying out nt now. I've been wondering about this myself, to what extent genomic reads would map to closely related taxa or not. I guess it depends on how similar their genomes are, and that's different in every case. We really don't know much about our genome. But we have sequenced MDA-amplified DNA, which might introduce sequence artifacts. I just hope that not all the unclassified reads are artifacts.

SumTot commented 4 years ago

Regarding to this issue, could it be then possible that some of my sequences are assigned to a sister taxa instead to the real one if there is not information about the real one in the database?

jonbra commented 4 years ago

I would think this is highly likely. As long as two species share identical stretches of sequence I guess Kraken is not able to distinguish which taxa a read matching this sequence is coming from. Now, I assume that the size of the kmers used is so long that this is rarely a problem, but for closely related taxa they can have highly similar genome sequences.

SumTot commented 4 years ago

Thanks! I wonder now if I did well setting the size of kmers with a smaller size than the default. The database was too heavy that cannot be built with the default numbers... Do you think this could affect to the classification?