DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
735 stars 274 forks source link

protocol results can not be reproduced #725

Open shishanfu opened 1 year ago

shishanfu commented 1 year ago

Hi, Thank you for this great tool.

I have encountered several questions while using it and hope someone can provide assistance.

Question 1:

When I following the steps of the protocol pathogen workflow, I found that I could not reproduce the results provided by the protocol. Although the final positive result is consistent, the proportion of unclassified samples has increased, and reads annotated to specific pathogens have decreased significantly as shown in the table below.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Sample | Total Reads | Unclassified Reads | Unclassified % | Classified Reads | Classified % | Bacteria | Archaea | Virus | Fungi | Amoeba | True Infection | Z-score Species | Taxid | Reads | Z-score -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- SRR12486971_protocol | 3,664,512 | 2,899,189 | 79.10% | 765,323 | 20.90% | 649,685 | 34 | 228 | 85,396 | 45 | Anncaliia algerae | Anncaliia algerae | 723287 | 84,409 | 56930 SRR12486971_test | 3,664,512 | 3,136,905 | 85.60% | 527,607 | 14.40% | 461,762 | 13 | 129 | 53,679 | 17 | Anncaliia algerae | Anncaliia algerae | 723287 | 53075 | 53080 SRR12486972_protocol | 7,594,644 | 7,285,624 | 95.90% | 309,020 | 4.10% | 150,302 | 10 | 50 | 31,602 | 54 | Aspergillus flavus | Aspergillus flavus | 5059 | 3,814 | 3814 SRR12486972_test | 7,594,644 | 7,415,687 | 97.64% | 178,957 | 2.36% | 104,352 | 2 | 12 | 23,427 | 26 | Aspergillus flavus | Aspergillus flavus | 5059 | 3028 | 3028

Data download: fastq-dump --split-files SRR12486971 (The total number of reads downloaded from ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/071/SRR12486971/SRR12486971_1.fastq.gz in `Kraken_pathogen.ipynb is lower than the results obtained from the protocol. Was any processing done on the data?)

Software version:Kraken version 2.1.2,

Database:k2_standard_eupath_20201202.tar.gz,

Analysis process:kraken2 --db k2protocol_db --threads 8 --minimum-hit-groups 3 --report SRR12486971.k2report --paired SRR12486971_1.fastq SRR12486971_2.fastq > SRR12486971.kraken2

What could be the reason for a decrease in reads assigned to pathogens?

Question 2:

While using thek2_standard_eupath_20201202.tar.gz database, I also built a custom database which is based on the standard library and includes some Complete Genomes of pathogenic microorganisms from GenBank. When I analyzed the test dataset with custom database, I found that reads assigned to Streptococcus agalactiae with taxid 1311 in negative samples were significantly increased. What could be the possible reason for this? <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Sample | SRR12486971 | SRR12486972 | SRR12486974 | SRR12486978 | SRR12486979 | SRR12486981 | SRR12486983 | SRR12486988 | SRR12486989 | SRR12486990 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- Classified Reads | 204 | 288 | 324 | 97 | 169 | 574 | 301 | 378 | 1630 | 631

Question 3:

Why isn't EuPathDB included in the standard library? How can I add EuPathDB to a custom database?

Thanks a lot

jenniferlu717 commented 12 months ago

Human reads were screened/removed twice by running bowtie2. Did you run bowtie2 to remove human before classification?

Increases in reads for S. agalactiae is likely due to a new Strep genome that was included recently.

EuPathDB is a special set of genomes that were manually screened to remove any contamination. It is not something updated that often so we cannot include it in the standard dataabse.