caozhichongchong / arg_ranker

MIT License
23 stars 11 forks source link

Different results when running arg_ranker v3.0.1 #11

Closed LeabaeL closed 1 year ago

LeabaeL commented 1 year ago

When I used arg_ranker -i $INPUT -m metadata.txt or arg_ranker -I $INPUT -kkdb $KRAKWNDB running the examples, I got different results.   The EsCo_genome.fasta result was the same as the README.md file. However, if I don’t uncompress the .zip file, I can not get arg_ranker results with WEE300. When I uncompressed the file, the results it’s not the same as what shows in the README.md file.

So I wonder if the reason for the different data processing results is because of the database update.

caozhichongchong commented 1 year ago

Hi,

Thank you for reaching out!

Could you please paste the result you got? If not, maybe let me know which columns are not the same (e.g. Rank_IV_per, Total_abu, Rank_I_risk)?

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni,

My operating systems are Ubuntu 22.04 and CentOS Linux 7, ant the software version list are as followed:

This is the result of running arg_ranker -i test and sh ./in/arg_ranking/script_output//arg_ranker.sh code in the undecompressed wee300_all-trim-decont_1.fastq file

1

This is the result of running the arg_ranker -i test and sh ./in/arg_ranking/script_output//arg_ranker.sh directive in unzipping the wee300_all-trim-decont_1.fastq file

2

This is the result of running the arg_ranker -i in -kkdb ./$KRAKENDB and sh ./in/arg_ranking/script_output//arg_ranker.sh directive in unzipping the wee300_all-trim-decont_1.fastq file

2

Best regards, LeabaeL

caozhichongchong commented 1 year ago

Hi LeabaeL,

Thank you for sharing more information! I just uploaded v3.0.2 and hopefully it can give us the right result. Could you please upgrade arg_rankker and try to run the examples again? pip install arg_ranker --upgrade

If updating arg_ranker can't solve the problem, could you please share arg_ranking/Sample_ARGpresence.txt with me? Thank you!

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

Thanks for your reply. I just update arg_ranker on my desktop, however the result remain the same as yesterday.

Attached are my results files.

Best regards, Leabael

---Original--- From: "Anni @.> Date: Thu, Mar 9, 2023 04:02 AM To: @.>; Cc: @.**@.>; Subject: Re: [caozhichongchong/arg_ranker] Different results when runningarg_ranker v3.0.1 (Issue #11)

Hi LeabaeL,

Thank you for sharing more information! I just uploaded v3.0.2 and hopefully it can give us the right result. Could you please upgrade arg_rankker and try to run the examples again? pip install arg_ranker --upgrade

If updating arg_ranker can't solve the problem, could you please share arg_ranking/Sample_ARGpresence.txt with me? Thank you!

Best regards, Anni

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

caozhichongchong commented 1 year ago

Hi Leabael,

Thank you for trying the update. Maybe email attachment was ignored by github, could you re-attach the result here?

Thank you!

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

Thanks for your reply. I updated arg_ranker on my desktop, however the result remains the same as yesterday.

Attached are my results files.

Sample_ARGpresence.txt

Best regards,

LeabaeL

caozhichongchong commented 1 year ago

Thank you for sharing, LeabaeL.

My result reported much fewer ARGs compared to yours. I think the different versions of diamond (mine is 0.9.36) or blast (2.13.0+) might be the reason.

Could you please run wc -l arg_ranking /search_output/WEE300_all-trimmed-decont* and send me the result? (py37) [anniz44@c3ddb-globus arg_ranking]$ wc -l search_output/WEE300_all-trimmed-decont* 13 search_output/WEE300_all-trimmed-decont_1.fastq.AGS.txt 12059 search_output/WEE300_all-trimmed-decont_1.fastq.blast.txt 349 search_output/WEE300_all-trimmed-decont_1.fastq.blast.txt.filter 381 search_output/WEE300_all-trimmed-decont_1.fastq.diamond.txt 762 search_output/WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa 1237469 search_output/WEE300_all-trimmed-decont_1.fastq.kraken 5739 search_output/WEE300_all-trimmed-decont_1.fastq.kraken.kreport 1256772 total

Thank you!

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

As you say the problem is with inconsistent diamond versions.

I have run wc -l arg_ranking /search_output/WEE300_all-trimmed-decont* in the original environment and here is the result.

6c5ea81de1ce77b9fec42aa44b56df1

And I create a new environment.

Here are the results:

Only changed the diamond version.

0d4e1b75d6c488bda38c415865723d0

As you can see, most results are the same, however, it seems like that Total_abu has a mistake which is unequal with README.md.

056e913cde0595422ac6b5e81212658

Thanks for your help!

Best regards, LeabaeL

LeabaeL commented 1 year ago

I would also be interested to know as there are different results due to different versions of diamond. As far as the test data is concerned, the higher version of diamond seems to annotate higher arg abundance. So which version of diamond would you recommend for practical use?

caozhichongchong commented 1 year ago

Hi LeabaeL,

Thank you for taking the time to test different versions of diamond!

Based on the results, I suspect that the difference in Total_abu values could be due to missing data from MicrobeCensus. Could you please check if you have the file arg_ranking/search_output/WEE300_all-trimmed-decont_1.fastq.AGS.txt? If this file is missing, you may need to install MicrobeCensus by git clone https://github.com/snayfach/MicrobeCensus && cd MicrobeCensus && python setup.py install.

If the file exists, could you please verify if we have the same MicrobeCensus results? grep 'average_genome_size' arg_ranking/search_output/WEE300_all-trimmed-decont_1.fastq.AGS.txt _[anniz44@c3ddb-globus search_output]$ grep 'average_genome_size' WEE300_all-trimmed-decont_1.fastq.AGS.txt average_genomesize: 5504511.121548271

If the MicrobeCensus results are the same, you could also verify if we have the same Kraken results. grep 'Bacteria' arg_ranking/search_output/WEE300_all-trimmed-decont_1.fastq.kraken.kreport _[anniz44@c3ddb-globus search_output]$ grep 'Bacteria' WEE300_all-trimmed-decont1.fastq.kraken.kreport 33.52 414860 1141 D 2 Bacteria

I recommend using diamond version 0.9.36, as this version was used to evaluate the ARG risk ranking in our manuscript. However, if you're interested, you can experiment with different diamond versions to see when the blast results differ significantly. If you do observe a significant difference, you can check the changes made in that version to understand the cause.

I'll try to investigate this further when I have more availability. In the meantime, I'll update the README to clarify which diamond version should be used.

This discussion is super helpful! Thank you!

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

Thanks for your answer. I will continue to explore the influence of diamond version on the results in the future.

I checked the result and download MicrobeCensus. The result is same.

屏幕截图 2023-03-21 170008

But, the result of Kraken is not the same.

屏幕截图 2023-03-21 170023

For now, the Total_abu has turned into 6.8E+02 after installing MicrobeCensus, and other results remain the same.

Also I found that, my Kraken2_standard_database only have 2GB, after use this code kraken2-build --standard --db ./Kraken_standard_database

Best regards, LeabaeL

caozhichongchong commented 1 year ago

Hi LeabaeL,

Thank you for keeping investigating into this!

I am concerned about the results from the Kraken database. The one I'm using is approximately 13GB in size (2019 version). Do you think it would be helpful to try downloading the Kraken database again?

_[anniz44@c3ddb-globus script_output]$ ls -lh /scratch/users/anniz44/scripts/database/krakendb total 42G -rw-rw-r-- 1 anniz44 mit_alm 1.8M Apr 5 2019 database100mers.kmer_distrib -rw-rw-r-- 1 anniz44 mit_alm 7.9M Apr 5 2019 database100mers.kraken -rw-rw-r-- 1 anniz44 mit_alm 13G Apr 5 2019 database.kraken -rw-rw-r-- 1 anniz44 mitalm 30G Apr 5 2019 hash.k2d

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

There seem to be a lot of problems with the installation of kraken2, I have tried several times but the download progress is interrupted every time. And the database that has been downloaded is not the same as the one listed in the screenshot you provided, both in terms of file size and file name.

I don't know if you can provide a tree diagram of the files contained in the kraken2 database, but I'm going to build the database in steps, as I've noticed in the kraken2 discussion that a lot of people have the same problem as me, and the author hasn't answered it yet

Best regards, LeabaeL

caozhichongchong commented 1 year ago

Hi LeabaeL,

Good to know!

Unfortunately, I don't have tree command installed on our computer cluster and I don't have admin to it :( Here I'm listing files under my kraken db. Alternatively, you might be able to use a prebuilt kraken2 database gtdb_r89_54k_kraken2_16gb.tar or MiniKraken database. _[anniz44@c3ddb-globus ~]$ ls -lha /scratch/users/anniz44/scripts/database/krakendb/ total 42G drwxrwxr-x 4 anniz44 mit_alm 4.0K Apr 5 2019 . drwxrwxr-x 7 anniz44 anniz44 32K Dec 11 13:32 .. -rw-rw-r-- 1 anniz44 mit_alm 1.8M Apr 5 2019 database100mers.kmer_distrib -rw-rw-r-- 1 anniz44 mit_alm 7.9M Apr 5 2019 database100mers.kraken -rw-rw-r-- 1 anniz44 mit_alm 13G Apr 5 2019 database.kraken -rw-rw-r-- 1 anniz44 mit_alm 30G Apr 5 2019 hash.k2d drwxrwxr-x 7 anniz44 mit_alm 4.0K Nov 5 2019 library -rw-rw-r-- 1 anniz44 mit_alm 756K Apr 5 2019 ncbi_sequence_id_tax_id.map -rw------- 1 anniz44 mit_alm 0 Apr 5 2019 nohup.out -rw-rw-r-- 1 anniz44 mit_alm 48 Apr 5 2019 opts.k2d -rw-rw-r-- 1 anniz44 mit_alm 1.5M Apr 5 2019 seqid2taxid.map -rw-rw-r-- 1 anniz44 mit_alm 1.7M Apr 5 2019 taxo.k2d drwxrwxr-x 2 anniz44 mitalm 4.0K Apr 5 2019 taxonomy

Hope it helps! It would be great if you can keep me updated and let me know if there's anything I can help with :)

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

I finally ran the same result as the README file.

Here is the environment required:

python = 3.8.16 diamond = 0.9.36 blast = 2.13.0 arg_ranker = 3.0.2 kraken2 = 2.1.2 microbecensus = 1.1.0

Also, I find a website that can download k2_standard_16gb as you offered link that can't download data stably in my lab.

I then also investigated the effect of different versions of Diamond on the annotation results and could find that the most important changes occurred in version 2.0.0, so I investigated the reasons for the changes.

As you can see, Diamond version 2.0.0 and above have improved sensitivity-related parameters, which may be the main reason for the differences in alignment results.

8be62dbf4e37e8fa6d79cc8dd4a6959

However, Diamond version 0.9.36 has significant issues, so I don't know if you will have any suggestions for modifying the Diamond version used by arg_ranker in the future.

caf7817ac59d6e53ecc55ab26c2b6c9

These are the results of my tests using different versions of Diamond and running them with either the Kraken database, the PlusPF database (which includes the 16GB standard database), or the 16GB standard database.

a8b205b96190c20861bfdfcadd043b0

I hope to receive more information about the use of Diamond versions in the future, and thank you for your patient answers during this time!

Best regards, LeabaeL

LeabaeL commented 1 year ago

Hi Anni:

I also noticed, as mentioned in this answer #12 , that the number of ARGs has been updated to 12746. However, it seems that the number of ARGs in the arg_ranker database of version 3.0.2 is only 4050. Could you please tell me how to locally deploy the latest classification database for data analysis?

Best regards, LeabaeL

caozhichongchong commented 1 year ago

Hi LeabaeL,

Thank you for your dedication to investigating different versions of Diamond and their performance in identifying matches.

Based on the description that diamond version 0.9.36 might give us incomplete results (also shown in your previous investigation), higher versions of diamond should output higher number of matches. It is intriguing to note that your tests using different versions of diamond shows the opposite trend when a kraken database was provided - a 1000-fold lower numbers of ARGs when using higher versions of diamond.

Could you please check whether the MicrobeCensus results and kraken results stay the same for your tests? _grep 'average_genome_size' arg_ranking/search_output/WEE300_all-trimmed-decont_1.fastq.AGS.txt grep 'Bacteria' arg_ranking/search_output/WEE300_all-trimmed-decont1.fastq.kraken.kreport

I'll investigate into the new ARG database mentioned in #12, and hopefully release an arg_ranker with the updated database :)

Thank you again!

Best regards, Anni

caozhichongchong commented 1 year ago

I just released arg_ranker v3.1 that incorporates the updated database SARGv3-S :) It's still undergoing testing, please feel free to contact me if you have any questions!

LeabaeL commented 1 year ago

Hi Anni:

Thank you for your prompt reply and update of the database.

However, I've just upgraded to 3.1 and it doesn't seem to be working properly to produce comment results.

The terminal displays the following error:

Traceback (most recent call last):
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/arg_ranker/bin/ARG_table.sum.py", line 93, in <module>
    sum_ARG(allsearchoutput)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/arg_ranker/bin/ARG_table.sum.py", line 75, in sum_ARG
    sampleARG[ARG] += 1.0 / copy_16S / ARG_length[ARG] * gene_length  # normalize against ARG length and 16S/genome length
KeyError: 'NC_011586.7045444.p01'
------------------------------------------------------------------------
arg_ranker evaluates the risk of ARGs in genomes/metagenomes
Requirement: blast, diamond, kraken, python 3
Copyright:An-Ni Zhang, MIT; Prof. Tong Zhang, University of Hong Kong
Contact caozhichongchong@gmail.com
------------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/haorui/anaconda3/envs/arg_ranker/bin/arg_ranker", line 8, in <module>
    sys.exit(main())
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/arg_ranker/__main__.py", line 289, in main
    ranking_arg(inputfasta,Metadata)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/arg_ranker/__main__.py", line 261, in ranking_arg
    df = pd.read_csv(Mothertable, index_col=None, header=None, skipinitialspace=True, sep='\t')
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine
    self.handles = get_handle(
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/common.py", line 856, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/home/haorui/0936/Sample_ARGpresence.txt'

Also I checked the results of testing with arg_ranker v3.0.2 under diamond version 2.1.6 and the AGS data remains consistent except that the kraken data is slightly different from what you provided earlier.

(base) [haorui@localhost search_output]$ grep 'average_genome_size' WEE300_all-trimmed-decont_1.fastq.AGS.txt 
average_genome_size:    5504511.121548271
(base) [haorui@localhost search_output]$ grep 'Bacteria' WEE300_all-trimmed-decont_1.fastq.kraken.kreport 
 34.06  421477  2797    D   2       Bacteria
  0.00  5   0   D1  2323          Bacteria incertae sedis
  0.00  5   0   D2  1783234         Bacteria candidate phyla

Best regards, LeabaeL

caozhichongchong commented 1 year ago

Oops! I forgot to update the length file for ARGs.

Thank you so much for testing it, LeabaeL! I updated arg_ranker v3.2 :)

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

Thank you for your prompt reply, but it seems that v3.2 has the same errors as v3.1.😂

Traceback (most recent call last):
  File "/home/haorui/anaconda3/envs/arg_ranker/bin/arg_ranker", line 8, in <module>
    sys.exit(main())
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/arg_ranker/__main__.py", line 289, in main
    ranking_arg(inputfasta,Metadata)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/arg_ranker/__main__.py", line 261, in ranking_arg
    df = pd.read_csv(Mothertable, index_col=None, header=None, skipinitialspace=True, sep='\t')
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine
    self.handles = get_handle(
  File "/home/haorui/anaconda3/envs/arg_ranker/lib/python3.8/site-packages/pandas/io/common.py", line 856, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/home/haorui/0936/Sample_ARGpresence.txt'

Best regards, LeabaeL

caozhichongchong commented 1 year ago

Thank you for testing it, LeabaeL!

I just updated v3.3 fixing a bug that there are no unassessed ARGs in the new ARG database, :) though I can't reproduce the "FileNotFoundError". It might be helpful to remove the "arg_ranking" folder before testing v3.3.

Best regards, Anni

LeabaeL commented 1 year ago

Hi Anni:

Thank you very much for your prompt replies over the days, I wasn't able to find the root cause of the file build error, but found a possible solution: recreate a virtual environment in conda for configuration.

Here is the environment I have configured, and the results of running the demo, which I think can be used as a reference:

diamond_version kraken arg_ranker Rank_I_per Rank_II_per Rank_III_per Rank_IV_per Unassessed_per Total_abu Rank_code Sample
0.9.36 README 3.0.2 4.6E-02 0.0E+00 6.8E-02 7.5E-01 1.3E-01 1.9E+00 1.5-0.0-0.4-1.7-0.4 WEE300_all-trimmed-decont_1.fastq
0.9.36 16GB 3.3 8.0E-02 2.6E-02 2.0E-01 7.0E-01 0.0E+00 1.5E+00 3.8-1.7-1.0-0.9-0.0 WEE300_all-trimmed-decont_1.fastq
2.1.6 16GB 3.3 6.6E-02 2.1E-02 2.2E-01 6.9E-01 0.0E+00 5.5E+00 3.1-1.3-1.1-0.9-0.0 WEE300_all-trimmed-decont_1.fastq
0.9.36 README 3.0.2 0.0E+00 0.0E+00 2.4E-01 7.6E-01 0.0E+00 2.1E+01 0.0-0.0-1.6-1.7-0.0 EsCo_genome.fasta
0.9.36 16GB 3.3 7.1E-02 0.0E+00 2.1E-01 7.1E-01 0.0E+00 1.4E+01 3.3-0.0-1.1-0.9-0.0 EsCo_genome.fasta
2.1.6 16GB 3.3 7.1E-02 0.0E+00 2.1E-01 7.1E-01 0.0E+00 1.4E+01 3.3-0.0-1.1-0.9-0.0 EsCo_genome.fasta

Among other things, the specific environment configuration and some of the previously communicated parameters are as follows:

  1. diamond = 0.9.36
(base) [haorui@localhost search_output]$ wc -l WEE300_all-trimmed-decont*
      13 WEE300_all-trimmed-decont_1.fastq.AGS.txt
   25685 WEE300_all-trimmed-decont_1.fastq.blast.txt
     306 WEE300_all-trimmed-decont_1.fastq.blast.txt.filter
     345 WEE300_all-trimmed-decont_1.fastq.diamond.txt
     690 WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa
 1237469 WEE300_all-trimmed-decont_1.fastq.kraken
    6323 WEE300_all-trimmed-decont_1.fastq.kraken.kreport
 1270831 total

(base) [haorui@localhost search_output]$ grep 'Bacteria' WEE300_all-trimmed-decont_1.fastq.kraken.kreport
 34.06  421477  2797    D   2       Bacteria
  0.00  5   0   D1  2323          Bacteria incertae sedis
  0.00  5   0   D2  1783234         Bacteria candidate phyla

(base) [haorui@localhost search_output]$ grep 'average_genome_size' WEE300_all-trimmed-decont_1.fastq.AGS.txt
average_genome_size:    5504511.121548271
  1. diamond = 2.1.6
(base) [haorui@localhost search_output]$ wc -l WEE300_all-trimmed-decont*
      13 WEE300_all-trimmed-decont_1.fastq.AGS.txt
   97904 WEE300_all-trimmed-decont_1.fastq.blast.txt
    1085 WEE300_all-trimmed-decont_1.fastq.blast.txt.filter
    1244 WEE300_all-trimmed-decont_1.fastq.diamond.txt
    2488 WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa
 1237469 WEE300_all-trimmed-decont_1.fastq.kraken
    6323 WEE300_all-trimmed-decont_1.fastq.kraken.kreport
 1346526 total

(base) [haorui@localhost search_output]$ grep 'Bacteria' WEE300_all-trimmed-decont_1.fastq.kraken.kreport
 34.06  421477  2797    D   2       Bacteria
  0.00  5   0   D1  2323          Bacteria incertae sedis
  0.00  5   0   D2  1783234         Bacteria candidate phyla

grep 'average_genome_size' WEE300_all-trimmed-decont_1.fastq.AGS.txt
average_genome_size:    5504511.121548271

Best regards, LeabaeL

caozhichongchong commented 1 year ago

Hi LeabaeL,

Thank you so much for testing different versions! The information you provided is super helpful to all users and I'll add it into the README. I think the results of diamond v 2.1.6 makes way more sense now, so I would recommend using this version.

It has been quite fun to work together on this! Please feel free to reach out if you have further questions. Good luck with your research :)

Best regards, Anni

caozhichongchong commented 1 year ago

Btw, I just acknowledged you in the README :)