RNA viruses exhibit vast phylogenetic diversity and can significantly impact public health and agriculture. However, current bioinformatics tools for viral discovery from metagenomic data frequently generate false positive virus results, overestimate viral diversity, and misclassify virus sequences. Additionally, current tools often fail to determine virus-host associations, which hampers investigation of the potential threat posed by a newly detected virus.
To address these issues we developed VirID, a software tool specifically designed for the discovery and characterization of RNA viruses from metagenomic data.
The basis of VirID is a comprehensive RNA-dependent RNA polymerase (RdRP) database to enhance a workflow that includes RNA virus discovery, phylogenetic analysis, and phylogeny-based virus characterization. Benchmark tests on a simulated data set demonstrated that VirID had high accuracy in profiling viruses and estimating viral richness.
In evaluations with real-world samples, VirID was able to identity RNA viruses of all type, but also provided accurate estimations of viral genetic diversity and virus classification, as well as comprehensive insights into virus associations with humans, animals, and plants. VirID therefore offers a robust tool for virus discovery and serves as a valuable resource in basic virological studies, pathogen surveillance, and early warning systems for infectious disease outbreaks.
Associated papers
:fire::collision:Our group has also developed LucaProt, a model for identifying RNA virus RdRP protein sequences based on artificial intelligence algorithms, which is now published in Cell.
fastp
, which is more adaptable to different environments.If you encounter problems during use, feel free to raise an issue.
The VirID framework for automated RNA virus detection, which comprises three main stages:(i) RNA virus discovery, (ii) phylogenetic analysis, and (iii) phylogeny-based virus characterization. It produces outputs that include viral sequences, phylogenetic trees, and comprehensive information including sequence length, best match of BLASTx comparison, virus classification, and host association. |
VirID requires third-party packages from the conda-forge and bioconda channels
conda create -n VirID
conda activate VirID
conda install -c bioconda blast bbmap seqkit mafft megahit trimal pplacer taxonkit bowtie2
conda install fastp taxonkit diamond==2.1.4 bowtie2 samtools==1.16.1
pip install Bio biopython DendroPy matplotlib numpy pandas regex seaborn tqdm
Notes:
Version of the tool available for reference:
The taxonkit dataset should also be downloaded!
All python packages will be downloaded automatically!
pip install VirID
#install R package
R
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
packages=c("tidyverse","ggplot2","RColorBrewer","phangorn","networkD3","jsonlite","dplyr","networkD3","jsonlite")
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg)
sapply(pkg, require, character.only = TRUE)
}
ipak(packages)
Notes: tidyverse is based on systemfonts and you may need the following code to install it
conda install r-systemfonts
VirID requires an environment variable named VirID_DB_PATH
, this is the parent directory for the following databases.
See below for specific database configurations.
#set VirID_DB_PATH to environment variable
export VirID_DB_PATH=/path/to/the/database/
Notes:
The databases take up a lot of space, so make sure you have enough disk space. If you already have these databases, you can skip the download step and just configure them.
The download speed of the database depends on the internet. You can also choose other download methods such as ascp
.
1.2 Unzip the file and Using bowtie2 to build the index.
bunzip2 -cv VirID_rRNA_db.fasta.bz2 > VirID_DB_PATH/rRNA/VirID_rRNA_db.fasta
bowtie2-build VirID_DB_PATH/rRNA/VirID_rRNA_db.fasta VirID_DB_PATH/rRNA/rRNA_cutout_ref
#Download the `PROT_ACC2TAXID` file
wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz.md5
#Check for the file integrity
md5sum -c prot.accession2taxid.gz.md5
#Unzip the files and onfiguration
gunzip -c prot.accession2taxid.gz > VirID_DB_PATH/accession2taxid/prot.accession2taxid
Note that if you use your own complete NT library, some putative RNA virus sequences may be missed.
VirID medthod [options]
VirID end_to_end -i 1.fastq -i2 2.fastq \
-out_dir out_path --threads 60 --keep_dup
VirID assembly_and_basic_annotation -i 1.fastq -i2 2.fastq \
-out_dir out_path --threads 60
VirID phylogenetic_analysis -classify_i test/test_contig.fasta \
-out_dir out_path --threads 90 --keep_dup