Open jonbra opened 4 years ago
Can you try building a database from NCBI's 'nt' database? Many taxa are not represented in the plants genomes database (that I think Kraken2 uses for the 'plants' database build).
I would expect reads to map to other members of the Phylum. The material I have used Kraken2 to analyze previously came from metagenomic sources and it is clear that we frequently get reads mapping to sister taxa when the target taxon (suspected) is not in the database.
With reasonable sequence homology it is not too surprising that many kmers would match between related taxa.
Is there anything else strange about the genome of the green alga you are targetting? Repeats, transposons, polyploidy or other variation might explain some difficulties.
Thanks, we're trying out nt now. I've been wondering about this myself, to what extent genomic reads would map to closely related taxa or not. I guess it depends on how similar their genomes are, and that's different in every case. We really don't know much about our genome. But we have sequenced MDA-amplified DNA, which might introduce sequence artifacts. I just hope that not all the unclassified reads are artifacts.
Regarding to this issue, could it be then possible that some of my sequences are assigned to a sister taxa instead to the real one if there is not information about the real one in the database?
I would think this is highly likely. As long as two species share identical stretches of sequence I guess Kraken is not able to distinguish which taxa a read matching this sequence is coming from. Now, I assume that the size of the kmers used is so long that this is rarely a problem, but for closely related taxa they can have highly similar genome sequences.
Thanks! I wonder now if I did well setting the size of kmers with a smaller size than the default. The database was too heavy that cannot be built with the default numbers... Do you think this could affect to the classification?
Hi,
I need some clarification regarding the unclassified reads, and how to classify reads when my species is not present in the database.
I have sequenced the genome of a green alga (Chlorophyta) that is not present in NCBI. I downloaded the plant, bacteria and protozoa databases, and the vast majority of my reads are Unclassified, with very few classified to plants, chlorophytes and bacteria. Can I interpret this as the unclassified reads belong to my target organism, or should I expect these reads to be classified within Chlorophyta as well?
Thanks!
Jon