AnantharamanLab / ViWrap

A wrapper to identify, bin, classify, and predict host-viral relationship for viruses
57 stars 13 forks source link

ViWrap

ViWrap: A modular pipeline to identify, bin, classify, and predict viral-host relationships for viruses from metagenomes

Oct 2022   
Zhichao Zhou  
zzhou388@wisc.edu & zczhou2017@gmail.com  
Anantharaman Lab
Department of Bacteriology
University of Wisconsin-Madison  

Current Version

ViWrap v1.3.0

Citation

If you find ViWrap useful please consider citing our manuscript on iMeta:

Zhou, Zhichao, Martin, Cody, Kosmopoulos, James C., and Anantharaman, Karthik. 2023. “ ViWrap: A Modular Pipeline to Identify, Bin, Classify, and Predict Viral–Host Relationships for Viruses from Metagenomes.” iMeta 2, e118. https://doi.org/10.1002/imt2.118

Table of Contents:

  1. Updates
  2. Program Description
  3. Installation
  4. Settings
  5. Running ViWrap
    • ViWrap tasks
    • Flag explanations
  6. Output Explanations
  7. Contact

Updates for v1.3.0 (Dec 2023):

--Updated on Dec 9, 2023

[Correction]

(1) Correct the VirSorter2 + CheckV pipeline to get viral scaffolds. Use the "combined.fna" from CheckV result dir.

(2) Add the "scf2lytic_or_lyso.summary.txt" in the VirSorter resulting dir to facilitate the downstream "modified vRhyme_best_bins" generation.

(3) Delete the scaffolds that are not related to virus in "vRhyme_input_coverage.txt".

(4) Correct the if " " in line or "\t" in line: # Break at the first " " or "\t" line in multiple scripts to ensure the break at the first " " or "\t" of the headers.

--Updated on Dec 11 and 12, 2023

[Improvement]

(1) Do not split fasta file, but split faa file in "run_annotate_by_VIBRANT_db.py" script (for virus identifying method of 'vs' and 'dvf').

(2) For virus identifying method of 'genomad', the faa file was the pyrodigal-gv annotated one by geNomad. The ffn file was also based on this faa file.

(3) Add AMG filtering step to the script.

--Updated on Dec 13 and 14, 2023

[Improvement]

(1) Update combine_iphop_results function in module.py to store only one host prediction result with the highest confidence score among all potential results.

[correction]

(1) Add "Unclassified metabolism" to the calculation of step 4 KO metabolism relative abundance (in function "generate_result_visualization_inputs"), since that some KOs might not have corresponding KO metabolisms according to the db.

(2) Add custom MAG viral scaffold filtering steps into "master_run.py". There are two filtering criteria: 1. Viral scaffolds from MAGs that are identified by geNomad were removed from the MAGs. 2. Viral scaffolds that contain proviruses with total region length >= 85% of the whole scaffold were removed from the MAGs since that they are very likely to be mistakenly-binned viral scaffolds.

--Updated on Dec 18, 2023

[Improvement]

(1) Update the taxonomical classification method:

Modify the script to adjust the priority based on the lowest rank of taxonomy not being 'NA' additionally. Currently, the script sets the priority as follows:

NCBI RefSeq viral protein searching Marker VOG HMM searching vContact2 clustering geNomad taxonomy However, if there are multiple hits by different methods, the method with the lowest taxonomic rank not being 'NA' should be used.

Updates for v1.2.1 (Jan 2023):

Updates for v1.2.0 (Oct 2022):

Updates for v1.1.0 (Sep 2022):

Updates for v1.0.0 (Sep 2022):


Program Description

ViWrap Description

ViWrap is a wrapper to identify, bin, classify, and predict host-viral relationships for viruses from metagenomes. It leverages the advantages of currently available virus analyzing tools and provides a quick, intuitive, one-step pipeline to get viral sequences and corresponding properties.

Note:

ViWrap is an integrated wrapper/pipeline, the main contributors of each virus identifying, binning, classifying, and viral host predicting software within it should be acknowledged (Citations and links are provided):

geNomad: link to online paper

Camargo, Antonio Pedro, Simon Roux, Frederik Schulz, Michal Babinski, Yan Xu, Bin Hu, Patrick SG Chain, Stephen Nayfach, and Nikos C. Kyrpides. "Identification of mobile genetic elements with geNomad." Nature Biotechnology (2023): 1-10. 

VIBRANT: link to online paper

Kieft, Kristopher, Zhichao Zhou, and Karthik Anantharaman. "VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences." Microbiome 8, no. 1 (2020): 1-23. 

VirSorter2: link to online paper

Guo, Jiarong, Ben Bolduc, Ahmed A. Zayed, Arvind Varsani, Guillermo Dominguez-Huerta, Tom O. Delmont, Akbar Adjie Pratama et al. "VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses." Microbiome 9 (2021): 1-13.

DeepVirFinder: link to online paper

Ren, Jie, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, and Fengzhu Sun. "Identifying viruses from metagenomic data using deep learning." Quantitative Biology 8 (2020): 64-77.

vContact2: link to online paper

Bin Jang, Ho, Benjamin Bolduc, Olivier Zablocki, Jens H. Kuhn, Simon Roux, Evelien M. Adriaenssens, J. Rodney Brister et al. "Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks." Nature biotechnology 37, no. 6 (2019): 632-639.

vRhyme: link to online paper

Kieft, Kristopher, Alyssa Adams, Rauf Salamzade, Lindsay Kalan, and Karthik Anantharaman. "vRhyme enables binning of viral genomes from metagenomes." Nucleic Acids Research 50, no. 14 (2022): e83-e83.

iPHoP (and software within it): link to online paper

Roux, Simon, Antonio Pedro Camargo, Felipe Hernandes Coutinho, Shareef M. Dabdoub, Bas E. Dutilh, Stephen Nayfach, and Andrew Tritt. "iPHoP: an integrated machine-learning framework to maximize host prediction for metagenome-assembled virus genomes." bioRxiv (2022): 2022-07.

ViWrap Features


Installation

Step 1 Set up the conda environment for ViWrap

Since ViWrap has many dependencies to be installed, it would be much easier to set up a conda environment instead of installing all dependencies in the global environment (make sure you have upfront conda installed on your server, i.e., miniconda3 or anaconda3; we only suggest to run in version 3.0+ conda). There are 12 conda envs associated with ViWrap, so it may be useful to keep these associated together in a directory separate from your normal conda installation, which will be referred to as /path/to/ViWrap_conda_environments. Note: ensure that wherever you install the ViWrap conda envs has at least 17 Gb of storage available. ViWrap was tested extensively with an environment installed separately from the normal conda installation.

Choose one:

  1. Install in separate directory:

    1. conda create -c bioconda -c conda-forge -p /path/to/ViWrap_conda_environments/ViWrap python=3.8 biopython=1.80 mamba=1.3.0 numpy=1.24.2 pandas=1.5.3 pyfastx=0.8.4 matplotlib=3.6.3 seaborn=0.12.2 diamond=2.0.15 hmmer=3.3.2
    2. conda activate /path/to/ViWrap_conda_environments/ViWrap

    Note: /path/to/conda_environments indicates the directory that you will need to use to store all conda environments for ViWrap

  2. Install in normal conda folder

    1. conda create -c bioconda -c conda-forge -n ViWrap python=3.8 biopython=1.80 mamba=1.3.0 numpy=1.24.2 pandas=1.5.3 pyfastx=0.8.4 matplotlib=3.6.3 seaborn=0.12.2 diamond=2.0.15 hmmer=3.3.2
    2. conda activate ViWrap

    If you choose to proceed this route, you will just need to use the path to your ViWrap conda installation. It will look something like this: /path/to/miniconda3/envs/ViWrap/

Step 2 GitHub installation

  1. git clone https://github.com/AnantharamanLab/ViWrap

  2. cd ViWrap

  3. chmod +x ViWrap scripts/*.py # Make all python scripts to be executable

  4. PATH=`pwd`:$PATH # Add ViWrap to the PATH, so it can be called elsewhere in a terminal

Step 2 Download ViWrap package and install (alternatively)

  1. wget -c https://github.com/AnantharamanLab/ViWrap/archive/refs/tags/v1.3.0.tar.gz

  2. tar xzf v1.3.0.tar.gz;rm v1.3.0.tar.gz

  3. cd ViWrap-1.3.0;chmod +x ViWrap scripts/*.py # Make all python scripts to be executable

  4. PATH=`pwd`:$PATH # Add ViWrap to the PATH, so it can be called elsewhere in a terminal

    ("v1.3.0" should be replaced to the latest version)

Step 3 Set up the other conda environments required by ViWrap

ViWrap set_up_env --conda_env_dir /path/to/ViWrap_conda_environments

This will take several minutes depending on your current internet speed.

Note: /path/to/ViWrap_conda_environments can be set anywhere on your server to contain ViWrap conda environments. If you use -n ViWrap in the above step, you will input /path/to/miniconda3/envs to --conda_env_dir option.

ViWrap will use the "-p" or "--prefix" option to specify where to write the environment files:

(Details in https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#specifying-a-location-for-an-environment)

For example,conda create --prefix /tmp/test-env python=3.8 will create the environment named /tmp/test-env which resides in /tmp/ instead of the default .conda.

The following 12 conda environments will be set up, the estimated running time will be ~10 minutes, depending on your current internet speed:

625M    ./ViWrap-Mapping
772M    ./ViWrap-vRhyme
4.0G    ./ViWrap-iPHoP
1.7G    ./ViWrap-DVF
271M    ./ViWrap-vs2
390M    ./ViWrap-GTDBTk
540M    ./ViWrap-dRep
102M    ./ViWrap-CheckV
1.6G    ./ViWrap-vContact2
88M ./ViWrap-Tax
153M    ./ViWrap-VIBRANT
1.6G    ./ViWrap-geNomad

Note: We have fixed the versions of Python modules to prevent potential errors caused by version upgrades. If you encounter any issues with these conda environments, please verify the module versions set in "scripts/master_set_up_env.py".

Step 4 Set up ViWrap database

ViWrap download --db_dir /path/to/ViWrap_db  --conda_env_dir /path/to/ViWrap_conda_environments

/path/to/ViWrap_db is the place you store the ViWrap database. Please make sure there is enough space to store the database (~430G at least). It will take ~4 hours to set up well depending on your current internet speed. This is kind of tedious, however, you will only need to do this one time.

It contains the following 8 folders (call by du -h --max-depth=1 ./ within the directory of "ViWrap_db"):

11G ./VIBRANT_db
6.4G    ./CheckV_db
114M    ./DVF_db
829M    ./Tax_classification_db
318G    ./iPHoP_db
11G ./VirSorter2_db
82G ./GTDB_db
1.4G    ./genomad_db

Notes:

1) Since some software (VirSorter2) needs to config the database address into the conda environment, it is suggest to first set up the environments, then set up the databases.

2) Once you have replaced any conda environments, it is better to re-check/re-install the corresponding conda environments (especially for the case of VirSorter2)

3) Since some db folders are restricted within the creator's rights, if anyone else in the group who wants to use ViWrap, the db folder rights should be opened by usingchmod -R 777 ./

Step 5 See ViWrap help

  1. ViWrap -h
  2. ViWrap run -h

Settings

Settings for v1.3.0


Running ViWrap

ViWrap tasks

Flag explanations

Test run

# TEST 1: Only use reads and metagenomic assembly
# example code for testing:
ViWrap run  --input_metagenome test_metaG.fasta \
            --input_reads reads_1.fastq.gz,reads_2.fastq.gz \
            --out_dir  test_metaG_ViWrap_out \
            --db_dir /storage1/data11/ViWrap/ViWrap_db \ # Change according to your case
            --identify_method vb-vs \
            --conda_env_dir /slowdata/yml_environments \ # Change according to your case
            --threads 20 \
            --input_length_limit 5000

# The total running time for TEST 1 is about 2 hrs  

# TEST 2: Use reads, metagenomic assembly, and custom MAGs (binned from the same metagenome)
# example code for testing:
ViWrap run  --input_metagenome test_metaG.fasta \
            --input_reads reads_1.fastq.gz,reads_2.fastq.gz \
            --out_dir  test_metaG_ViWrap_out \
            --db_dir /storage1/data11/ViWrap/ViWrap_db \ # Change according to your case
            --identify_method vb \
            --conda_env_dir /slowdata/yml_environments \ # Change according to your case
            --threads 20 \
            --input_length_limit 5000 \
            --custom_MAGs_dir custom_MAGs_dir \
            --iPHoP_db_custom iPHoP_db_custom

# The total running time for TEST 2 is about 19 hrs             

Output Explanations

All result folders

Hierarchy in 08_ViWrap_summary_outdir

Hierarchy in 09_Virus_statistics_visualization


Notes


Contact

Please contact Zhichao Zhou (zczhou2017@gmail.com or GitHub Issues) with any questions, concerns or comments.

Thank you for using ViWrap!

___________________________________________
 __ __  ____  __    __  ____    ____  ____  
|  |  ||    ||  |__|  ||    \  /    ||    \ 
|  |  | |  | |  |  |  ||  D  )|  o  ||  o  )
|  |  | |  | |  |  |  ||    / |     ||   _/ 
|  :  | |  | |  `  '  ||    \ |  _  ||  |   
 \   /  |  |  \      / |  .  \|  |  ||  |   
  \_/  |____|  \_/\_/  |__|\_||__|__||__|   
___________________________________________

Copyright

ViWrap Copyright (C) 2022 Zhichao Zhou

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.