Please post any questions, comments, or concerns.
The paper is published at Nature Communications:
https://doi.org/10.1038/s41467-018-05658-8
Maximal viral information recovery from sequence data using VirMAP
Nadim J Ajami1,2, Ω, Matthew C. Wong1,2, Ω, Matthew C. Ross1,2, Richard E. Lloyd2, Joseph F. Petrosino1,2.
1 Alkek Center for Metagenomics and Microbiome Research, and 2 Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, Texas.
Ω Authors contributed equally to this work.
Corresponding author:
Nadim J. Ajami
nadimajami@gmail.com
Abstract
Accurate classification of the human virome is critical to a full understanding of the role viruses play in health and disease. This implies the need for sensitive, specific, and practical pipelines that return precise outputs while still enabling case-specific post hoc analysis. Viral taxonomic characterization from metagenomic data suffers from high background noise and signal crosstalk that confounds current methods. Here we develop VirMAP that overcomes these limitations using techniques that merge nucleotide and protein information to taxonomically classify viral reconstructions independent of genome coverage or read overlap. We validate VirMAP using published data sets and viral mock communities containing RNA and DNA viruses and bacteriophages. VirMAP offers opportunities to enhance metagenomic studies seeking to define virome-host interactions, improve biosurveillance capabilities, and strengthen molecular epidemiology reporting.
Cite as:
Ajami, N. J., Wong, M. C., Ross, M. C., Lloyd, R. E., & Petrosino, J. F. (2018). Maximal viral information recovery from sequence data using VirMAP. Nature Communications, 9(1), 3205. https://doi.org/10.1038/s41467-018-05658-8
Rough Requirements:
Hardware (recommanded):
64GB+ RAM
12 cores+ CPU
per instance.
Perl:
Multi-threaded 5.24+
Installed programs:
diamond (https://github.com/bbuchfink/diamond)
bbtools (https://jgi.doe.gov/data-and-tools/bbtools)
blast+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST)
lbzip2 (http://lbzip2.org/download)
MEGAHIT (https://github.com/voutcn/megahit)
normalize-by-median.py (https://github.com/dib-lab/khmer)
pigz (https://github.com/madler/pigz)
vsearch (https://github.com/torognes/vsearch)
zstd (https://github.com/facebook/zstd)
GNU parallel (https://www.gnu.org/software/parallel/)
CPAN Dependencies:
Compress::Zstd (https://metacpan.org/pod/Compress::Zstd)
REST::Client (https://metacpan.org/pod/REST::Client)
OpenSourceOrg::API (https://metacpan.org/pod/OpenSourceOrg::API)
POSIX::1003::Sysconf (https://metacpan.org/pod/distribution/POSIX-1003/lib/POSIX/1003/Sysconf.pod)
RocksDB (https://metacpan.org/pod/RocksDB)
Sereal (https://metacpan.org/pod/Sereal)
Text::Levenshtein::Damerau::XS (https://metacpan.org/pod/Text::Levenshtein::Damerau::XS)
Text::Levenshtein::XS (https://metacpan.org/pod/Text::Levenshtein::XS)
Array::Shuffle (https://metacpan.org/pod/Array::Shuffle
Array::Split (https://metacpan.org/pod/Array::Split)
Sys::MemInfo (https://metacpan.org/pod/Sys::MemInfo)
XML::Hash::XS (https://metacpan.org/pod/XML::Hash::XS)
Cpanel::JSON::XS (https://metacpan.org/pod/Cpanel::JSON::XS)
Statistics::Basic (https://metacpan.org/pod/distribution/Statistics-Basic/lib/Statistics/Basic.pod)
Custom Dependencies:
FAlite (http://korflab.ucdavis.edu/Unix_and_Perl/FAlite.pm)
r5d.24xlarge (recommended) or c5d.24xlarge.
Amazon Linux 2 (recommended) or Ubuntu.
Root volume size >30Gb
Local instance store SSD with >500 GB memory and 64GB RAM. As Genbank expands, the minimum SSD and RAM requirements will expand as well.
Instructions have only been tested on a fresh Amazon Linux 2 image.
Example command line input for Amazon Linux 2:
sudo yum -y update
wget https://raw.githubusercontent.com/cmmr/virmap/master/virmapInstall.pl
chmod 0775 virmapInstall.pl
./virmapInstall.pl
mkdir /scratch/tmp
(if it doesn't already exist)
export TMPDIR=/scratch/tmp
mkdir /scratch/VirmapDb
makeVirmapDb.pl --outputDir /scratch/VirmapDb
vdb-config -i
Exit out of vdb-config by hitting 'x'
mkdir /home/$USER/VirmapTest
fasterq-dump -t /dev/shm -e 4 -O /home/$USER/VirmapTest SRR9875293
Virmap.pl --threads $(nproc) --readF /home/$USER/VirmapTest/SRR9875293_1.fastq --readR /home/$USER/VirmapTest/SRR9875293_2.fastq --useMegahit --useBbnorm --sampleName SRR9875293 --outputDir /home/$USER/VirmapTest/VirmapRun --taxaJson /scratch/VirmapDb/Taxonomy.virmap --virDmnd /scratch/VirmapDb/virDmnd.dmnd --virBbmap /scratch/VirmapDb/virBbmap --gbBlastn /scratch/VirmapDb/gbBlastn --gbBlastx /scratch/VirmapDb/gbBlastx.dmnd 2>/home/$USER/VirmapTest/VirmapRun.err
Save VirmapDB in an s3 location.
Make a snapshot/image of the VM.
When running Virmap, copy the database from S3 to the local instance store SSD for use. Using the database over gp2 is not recommended.
Launch image.
Machine must have at least 64GB RAM per simultaneous instance of Virmap.
Local instance SSD >500GB.
ssh into machine.
Recreate local scratch space.
sudo mkdir /scratch; nvmeList = $(sudo nvme list | grep "Amazon EC2 NVMe Instance Storage" | cut -f1 -d " " | tr "\n" " "); nvmeCount = $(sudo nvme list | grep "Amazon EC2 NVMe Instance Storage" | wc -l); if [ $nvmeCount -gt 1 ]; then sudo mdadm --create /dev/md0 --level=0 --raid-devices=$nvmeCount $nvmeList; sudo mkfs.ext4 /dev/md0; sudo mount /dev/md0 /scratch; else sudo mkfs.ext4 $nvmeList; sudo mount $nvmeList /scratch; fi
sudo chown -R <user>:<group> /scratch
sudo chmod -R 0775 /scratch
Copy VirmapDB to scratch from your s3 space.
Set TMPDIR to somewhere on /scratch.
Virmap now uses GNU parallel in one of its substeps, please cite GNU parallel if you use Virmap.
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.