ZiyueYang01 / VirID

VirID: An integrated platform for the discovery and characterization of RNA Viruses
MIT License
11 stars 5 forks source link

VirID: Beyond Virus Discovery - An Integrated Platform for Comprehensive RNA Virus Characterization

Image


RNA viruses exhibit vast phylogenetic diversity and can significantly impact public health and agriculture. However, current bioinformatics tools for viral discovery from metagenomic data frequently generate false positive virus results, overestimate viral diversity, and misclassify virus sequences. Additionally, current tools often fail to determine virus-host associations, which hampers investigation of the potential threat posed by a newly detected virus.

To address these issues we developed VirID, a software tool specifically designed for the discovery and characterization of RNA viruses from metagenomic data.

The basis of VirID is a comprehensive RNA-dependent RNA polymerase (RdRP) database to enhance a workflow that includes RNA virus discovery, phylogenetic analysis, and phylogeny-based virus characterization. Benchmark tests on a simulated data set demonstrated that VirID had high accuracy in profiling viruses and estimating viral richness.

In evaluations with real-world samples, VirID was able to identity RNA viruses of all type, but also provided accurate estimations of viral genetic diversity and virus classification, as well as comprehensive insights into virus associations with humans, animals, and plants. VirID therefore offers a robust tool for virus discovery and serves as a valuable resource in basic virological studies, pathogen surveillance, and early warning systems for infectious disease outbreaks.


Associated papers

Yang Z, Shan Y, Liu X, Chen G, Pan Y, Gou Q, Zou J, Chang Z, Zeng Q, Yang C, et al. 2024. VirID: Beyond Virus Discovery - An Integrated Platform for Comprehensive RNA Virus Characterization. Molecular biology and evolution.

:fire::collision:Our group has also developed LucaProt, a model for identifying RNA virus RdRP protein sequences based on artificial intelligence algorithms, which is now published in Cell.


Update logs

If you encounter problems during use, feel free to raise an issue.


VirID workflow

Image
The VirID framework for automated RNA virus detection, which comprises three main stages:(i) RNA virus discovery, (ii) phylogenetic analysis, and (iii) phylogeny-based virus characterization. It produces outputs that include viral sequences, phylogenetic trees, and comprehensive information including sequence length, best match of BLASTx comparison, virus classification, and host association.

Installation

Step 1: Install conda and third-party dependencies

VirID requires third-party packages from the conda-forge and bioconda channels

conda create -n VirID
conda activate VirID
conda install -c bioconda blast bbmap seqkit  mafft megahit trimal  pplacer  taxonkit  bowtie2
conda install fastp taxonkit diamond==2.1.4  bowtie2 samtools==1.16.1
pip install Bio biopython DendroPy  matplotlib    numpy   pandas regex seaborn  tqdm

Notes:

Step 2: Install VirID via pip

All python packages will be downloaded automatically!

pip install VirID

Step 3: Install R and R package

#install R package
R
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("ggtree")
packages=c("tidyverse","ggplot2","RColorBrewer","phangorn","networkD3","jsonlite","dplyr","networkD3","jsonlite")
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg))  
        install.packages(new.pkg)
    sapply(pkg, require, character.only = TRUE)
}
ipak(packages)

Notes: tidyverse is based on systemfonts and you may need the following code to install it

conda install r-systemfonts

Step 4: Downloading and configuring the database.

VirID requires an environment variable named VirID_DB_PATH, this is the parent directory for the following databases.

See below for specific database configurations.

#set VirID_DB_PATH to environment variable
export VirID_DB_PATH=/path/to/the/database/

Notes:

rRNA

PROT_ACC2TAXID

  #Download the `PROT_ACC2TAXID` file
  wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
  wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz.md5

  #Check for the file integrity
  md5sum -c prot.accession2taxid.gz.md5

  #Unzip the files and onfiguration
  gunzip -c prot.accession2taxid.gz > VirID_DB_PATH/accession2taxid/prot.accession2taxid

NCBI Non-Redundant Protein Database (NR)

NCBI Nucleotide Sequence Database (NT) without Virus sequences

Note that if you use your own complete NT library, some putative RNA virus sequences may be missed.


Usage

VirID medthod [options]

Example

VirID end_to_end  -i 1.fastq -i2 2.fastq \
    -out_dir out_path  --threads 60 --keep_dup

VirID assembly_and_basic_annotation -i 1.fastq -i2 2.fastq \
    -out_dir out_path  --threads 60 

VirID phylogenetic_analysis -classify_i test/test_contig.fasta   \
    -out_dir out_path   --threads 90 --keep_dup