DerrickWood / kraken

Kraken taxonomic sequence classification system
http://ccb.jhu.edu/software/kraken/
GNU General Public License v3.0
212 stars 104 forks source link

Query Annotation based on mini kraken and custom kraken Databases #25

Closed nalandaatmi closed 9 years ago

nalandaatmi commented 9 years ago

Dear Derrick,

Query regarding Annotation: My metagenomics forward and reverse fastq files have 20 million reads. After removing plant similar reads from my input fastq files using (fastq_screen pipeline), I had 4 million reads. Then I provided this fastq file (4 million reads) as input to metAMOS pipeline. FCP option has annotated those reads but each of the custom kraken database and minikraken did not annotate as expected. What could have been the reason?

But for the initial fastq files (with 20 million reads), kraken custom DB based on nt database annotated correctly. custom kraken nt database for 20 million reads

I tried four different databases with metAMOS pipeline. 1) Using minikraken database (DB size 4.5GB), for these 4 million reads I received an output with no hits in annotation. minikraken

2) Using custom kraken database (Bacterial, Viral, Archaeal, Fungal) (DB size 105GB), for these 4 million reads. custom krakendb bacteria archaea viral and fungal

3) Using custom kraken database (nt database from ncbi) (DB size - 604GB), for these 4 million reads. custom kraken nt database

4) Using FCP database, for these 4 million reads. annotation based on fcp database

DerrickWood commented 9 years ago

Hi, generally when Kraken leaves a read without a classification, that's because there's no sequence with close enough homology to find a match. It would not surprise me if the Naive Bayes method used by FCP (I believe that's what metAMOS would use here) is slightly more sensitive than Kraken, however such a high jump in classification percentage between Kraken and FCP makes me suspect that FCP is over predicting (i.e., is lacking precision) on this dataset. Unfortunately, I can't really state anything conclusively without really looking through the data.