more than 1000+ tandem repeat reads were assigened to a species, is it reliable?

DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system

MIT License

726 stars 273 forks source link

more than 1000+ tandem repeat reads were assigened to a species, is it reliable? #562

Open lingxuan85511 opened 2 years ago

lingxuan85511 commented 2 years ago

I used Kraken2+Bracken to quantify the composition of microbes in my metagenomic data with pre-build database (PlusPF). However, I found there are more than 1000+ reads were assigned to species A. After I got the PE reads related to species A by using bowtie2 mapping, I found all reads are tandem repeats. Since reads with low complexity is less informative, is this result reliable? How can I dismissed the impact of tandem repeats reads when using Kraken2+Bracken?

jenniferlu717 commented 2 years ago

What you could do is mask the reads prior to classification using dust. We do mask the database sequences themselves, but it does not prevent all of these.

You could also rerun kraken using --report-minimizer-data: https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#distinct-minimizer-count-information