DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 266 forks source link

reads classified to both bacteria and mice genomes #781

Open luzhang321 opened 7 months ago

luzhang321 commented 7 months ago

Hi:) I ran kraken2.1.2 for my mice gut microbiome shotgun metagenomic analysis. I built a custom db by using the following command

kraken2-build --download-taxonomy --db refseq
kraken2-build --download-library archaea --db refseq --threads 8
kraken2-build --download-library bacteria --db refseq --threads 8 
kraken2-build --download-library plasmid --db refseq --threads 8 
kraken2-build --download-library viral --db refseq --threads 8 
kraken2-build --download-library fungi --db refseq --threads 8 
kraken2-build --download-library human --db refseq --threads 8 
kraken2-build --download-library UniVec_Core --db refseq --threads 8
awk '/^>/ {print ">kraken:taxid|10090|" substr($0, 2); next} 1' /GRCm39/ncbi-genomes-2022-11-21/GCF_000001635.27_GRCm39_genomic.fna > mouse/GCF_000001635.27_GRCm39_genomic_modified.fna

kraken2-build --add-to-library mouse/GCF_000001635.27_GRCm39_genomic_modified.fna --db refseq

kraken2 --threads 50 --db mouse/refseq/ --paired --gzip-compressed --confidence 0.05 --classified-out $filename#.fq $fq_dir$filename"_1.nohost.fastq.gz" $fq_dir$filename"_2.nohost.fastq.gz" --output $out_dir$filename".output.txt" --report $out_dir$filename".report.txt"

I ran the classification and found out that a large percentage of my reads (>80%) can be classified either from mice or this one bacteria species Curtobacterium flaccumfaciens. (I have performed host remove by bowtie2 before kraken2) Since there is output of the classified reads. I randomly selected reads and blast with nr database. This is a read classified by kraken2 as mice image This is a read classified by kraken2 as Curtobacterium flaccumfaciens image

From blast result, it seems there is some overlap of the result. Does this mean that the bacteria actually from the mice genome or kraken2 can't distinguish these 2 genomes due to similarity? Do you have any suggestions for this result?

example for blast @A00551:529:H5JGVDSX5:4:1101:9372:1125 kraken:taxid|10090 AATATGGCGAAGAAAACTGAAAAAGGTGGAATATTTAGAAATGTCCACTGTAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACATGACGACTTGAAAAATGACGAAATCACTAAAATACGTGAAAAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF

@A00551:529:H5JGVDSX5:4:1101:23547:1125 kraken:taxid|138532 TGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACCTGAAAAATGAGAAATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCACGGAAAATGAGAAAAGATCGGAAGAGCACACGTCTGAACTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFF:FFFFFFFFFFFFFFFFFFFF

Thanks in advance!

jenniferlu717 commented 6 months ago

The Curtobacterium flaccumfaciens genome is contaminated. That plasmid showing up in the blast results is actually pure mouse DNA. That plasmid should not be in the database but I'm checking the standard database.

luzhang321 commented 6 months ago

Hi, thanks very much for the quick reply. How can we know this is a contamination?

I checked my downloaded plasmid database, it is contained in my downloaded fna files of the plasmid library kraken2-build --download-library plasmid --db refseq --threads 8

grep -A 3 'CP045290.2' library.fna

NZ_CP045290.2 Curtobacterium flaccumfaciens pv. flaccumfaciens strain P990 plasmid pCff3, complete sequence TTTTCCGCGGATTTCAAGCCTTTATTGCTATATTACAGGTCCCAGTGTGCATTTGACTTC GCATGCTTTTAGTGATTTCATGATTAAGTCAGTTATGATGTTTATTTATGACATTTTCAG

I also checked my previous downloaded standard database(2022,sep) it is also included

grep CP045290.2 seqid2taxid.map 
kraken:taxid|138532|NZ_CP045290.2   138532

And I checked standard database(2023,sep) it is also included grep CP045290.2 seqid2taxid.map kraken:taxid|138532|NZ_CP045290.2 138532 NZ_CP045290.2 138532

jenniferlu717 commented 6 months ago

it must have made it through ncbi's checks and ended up in the database. We only use completed genomes, assuming that they are non-contaminated. It would need to be reported to NCBI