caozhichongchong / arg_ranker

MIT License
23 stars 11 forks source link

Why did I get the same result with two pieces of code? #9

Closed Rooobben closed 1 year ago

Rooobben commented 1 year ago

When I used arg_ranker -i $INPUT -kkdb $KRAKENDB or arg_ranker -i $INPUT -kkdb $KRAKENDB -kkdbtype 16S, I got the same output. I had downloaded the kraken2 16S database and standard database separately.

How could I slove it, thanks a lot.

caozhichongchong commented 1 year ago

Hi,

Thank you for reaching out! Could you please share with me the code in arg_ranking/script_output/arg_ranker.sh for both kraken2 16S database and standard database?

Best regards, Anni

Rooobben commented 1 year ago

The code in arg_ranker.sh for kraken2 16S database

diamond blastx --query /home/arg_ranker/INPUT/2_1.fastq --db /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/data/SARG.db.fasta.dmnd --out /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.diamond.txt --outfmt 6 --max-target-seqs 1 --evalue 1e-07 --id 80 --query-cover 75 --threads 48

python /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/bin/Extract.MG.py -p 1 -i /home/arg_ranker/INPUT -f 2_1.fastq -n .diamond.txt -r /home/arg_ranker/16s/arg_ranking/search_output/

kraken2 --db /home/data/kraken2_16s/ /home/arg_ranker/INPUT/2_1.fastq --output /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.kraken --report /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.kraken.kreport --threads 48

blastx -query /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.diamond.txt.aa -db /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/data/SARG.db.fasta -out /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.blast.txt -outfmt 6 -evalue 1e-07 -num_threads 48

python /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/bin/Filter.MG.py --g F -i /home/arg_ranker/16s/arg_ranking/search_output/ -f /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.blast.txt -db /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/data/SARG.db.fasta -dbf 2 -s 1 --ht 75 --id 80 --e 1e-07

The code in arg_ranker.sh for kraken2 standard database

diamond blastx --query /home/arg_ranker/INPUT/2_1.fastq --db /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/data/SARG.db.fasta.dmnd --out /home/arg_ranker/arg_ranking/search_output/2_1.fastq.diamond.txt --outfmt 6 --max-target-seqs 1 --evalue 1e-07 --id 80 --query-cover 75 --threads 48

python /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/bin/Extract.MG.py -p 1 -i /home/arg_ranker/INPUT -f 2_1.fastq -n .diamond.txt -r /home/arg_ranker/arg_ranking/search_output/

kraken2 --db /home/data/kraken/ /home/arg_ranker/INPUT/2_1.fastq --output /home/arg_ranker/arg_ranking/search_output/2_1.fastq.kraken --report /home/arg_ranker/arg_ranking/search_output/2_1.fastq.kraken.kreport --threads 48

/home/data/MicrobeCensus/build/scripts-3.7/run_microbe_census.py -t 48 /home/arg_ranker/INPUT/2_1.fastq /home/arg_ranker/arg_ranking/search_output/2_1.fastq.AGS.txt

blastx -query /home/arg_ranker/arg_ranking/search_output/2_1.fastq.diamond.txt.aa -db /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/data/SARG.db.fasta -out /home/arg_ranker/arg_ranking/search_output/2_1.fastq.blast.txt -outfmt 6 -evalue 1e-07 -num_threads 48

python /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/bin/Filter.MG.py --g F -i /home/arg_ranker/arg_ranking/search_output/ -f /home/arg_ranker/arg_ranking/search_output/2_1.fastq.blast.txt -db /home/soft/anaconda3/lib/python3.7/site-packages/arg_ranker/data/SARG.db.fasta -dbf 2 -s 1 --ht 75 --id 80 --e 1e-07

Thanks for your reply.

caozhichongchong commented 1 year ago

Thank you for providing me more information!

The code seems right to me.

Could you please run: head /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.kraken.kreport head /home/arg_ranker/arg_ranking/search_output/2_1.fastq.kraken.kreport grep 'average_genome_size' /home/arg_ranker/arg_ranking/search_output/2_1.fastq.AGS.txt

Could you please also share with me part of the results of Sample_ranking_results.txt and maybe Sample_ARGpresence.txt in dir /home/arg_ranker/arg_ranking/ and /home/arg_ranker/16s/arg_ranking/?

Thank you!

Rooobben commented 1 year ago

Yes, I checked the code when I used them.

head /home/arg_ranker/16s/arg_ranking/search_output/2_1.fastq.kraken.kreport 99.60 30578337 30578337 U 0 unclassified 0.40 124209 0 R 1 root 0.40 124137 18651 D 3 Bacteria 0.30 92900 12053 P 34 Firmicutes 0.26 80758 37801 C 225 Bacilli 0.14 42440 21799 O 521 Bacillales 0.07 20327 73 F 995 Staphylococcaceae 0.07 20232 19007 G 1738 Staphylococcus 0.00 490 490 S 2720 Staphylococcus sciuri 0.00 368 368 S 2714 Staphylococcus aureus

head /home/arg_ranker/arg_ranking/search_output/2_1.fastq.kraken.kreport 15.51 4761943 4761943 U 0 unclassified 84.49 25940603 3625 R 1 root 84.47 25934931 2140 R1 131567 cellular organisms 84.36 25899720 80154 D 2 Bacteria 71.40 21922316 6211 D1 1783272 Terrabacteria group 61.24 18802113 8057 P 1239 Firmicutes 61.20 18789458 13974 C 91061 Bacilli 60.57 18596186 29768 O 1385 Bacillales 60.43 18554237 103190 F 90964 Staphylococcaceae 36.52 11212529 69313 G 2803850 Mammaliicoccus

grep 'average_genome_size' /home/arg_ranker/arg_ranking/search_output/2_1.fastq.AGS.txt average_genome_size: 3120104.1834389744

/home/arg_ranker/arg_ranking/Sample_ranking_results.txt Sample Rank_I_per Rank_II_per Rank_III_per Rank_IV_per Unassessed_per Total_abu Rank_code Rank_I_risk Rank_II_risk Rank_III_risk Rank_IV_risk Unassessed_risk 2_1.fastq 5.1E-02 3.0E-04 1.2E-01 7.1E-01 1.1E-01 3.4E+00 1.7-0.1-0.8-1.6-0.3 1.7 0.1 0.8 1.6 0.3

/home/arg_ranker/arg_ranking/Sample_ARGpresence.txt |ARG_gene_family | ARG_antibiotics | 2_1.fastq 1405331A|aminoglycosidestreptomycin resistance protein|aminoglycoside|0.00335937208184641 1708210A|beta-lactamclass C beta-lactamase|beta-lactam|0

/home/arg_ranker/16s/arg_ranking/Sample_ranking_results.txt Sample Rank_I_per Rank_II_per Rank_III_per Rank_IV_per Unassessed_per Total_abu Rank_code Rank_I_risk Rank_II_risk Rank_III_risk Rank_IV_risk Unassessed_risk 2_1.fastq 5.1E-02 3.0E-04 1.2E-01 7.1E-01 1.1E-01 3.6E-01 1.7-0.1-0.8-1.6-0.3 1.7 0.1 0.8 1.6 0.3

/home/arg_ranker/16s/arg_ranking/Sample_ARGpresence.txt |ARG_gene_family | ARG_antibiotics | 2_1.fastq 1405331A|aminoglycosidestreptomycin resistance protein|aminoglycoside|0.000348188589221019 1708210A|beta-lactamclass C beta-lactamase|beta-lactam|0.0000442773218356269

Cause for confusion about the results is because I used ARgS-OAP 2.0 for detection of ARGs about same samples before. Different standardization methods produced different results.

Thanks for your reply.

caozhichongchong commented 1 year ago

Thank you for providing more details!

I think the results make sense that different ways of computing bacterial copy number should only change the total abundance of ARGs (Total_abu), but not the relative abundance of ARGs compared to other ranks (Rank_I_per, Rank_code, and Rank_I_risk).

From your results, the total abundance of ARGs computed based on standard kraken database = 3.4E+00 and that computed by kraken 16S database = 3.6E-01.

In your sample, there's a 10-fold difference between:

Thus, there's a 10-fold difference in total ARG abundance reported by these two metrics.

I think ARGs-OAP v2.0 include a lot more ARGs in their database, and they use coverage (read depth) on essential genes to compute bacterial cell copy number. What's their total ARG abundance?

Thank you!

Rooobben commented 1 year ago

Thank you for your patience. In the results of ARGs-OAP 2.0, the abundance of ARG in sample 2 was 0.93 based on the 16s rRNA copy normalization, and 9.671 based on the cell copy normalization, which was about 10 times higher.

Rooobben commented 1 year ago

But I found that in ARGs-OAP 2.0 calculating the average 16S rRNA gene copy number by copy number information according to CopyRighter database, and in arg_ranker compute the abundance of ARGs as copy number of ARGs per 16S by kraken2 16S database. Does the difference matter? Because for the same sample, in arg_ranker, the number of reads mapped to bacterial 16S is 124137*2, but in ARGs-OAP 2.0, #of16Sreads (meta_data_online.txt) is 17296.3.

Thanks for your reply.

caozhichongchong commented 1 year ago

Thank you for sharing!

It makes sense to see different results in terms of normalization. Those two pipelines use different 16S databases (I think kraken's is more updated) and different searching algorithms (I think short read mapping for kraken, blast for ARGs-OAP). It will be fine as long as you use the same pipeline to evaluate/compare your samples :) because it would be problematic to directly compare values generated by different pipelines.

Hope it helps! Anni