ViralCC
is a new open-source metagenomic Hi-C-based binning pipeline to recover high-quality viral genomes.
ViralCC
not only considers the Hi-C interaction graph, but also puts forward a novel host proximity graph of viral contigs
as a complementary source of information to the remarkably sparse Hi-C interaction map. The two graphs are then integrated together,
followed by the Leiden graph clustering using the integrative graph to generate draft viral genomes.
If you want to reproduce results in our ViralCC paper, please read our instructions here.
Scripts to process the intermediate data and plot figures of our ViralCC paper are available here.
ViralCC
requires only a standard computer with enough RAM to support the in-memory operations.
ViralCC
v1.0.0 is supported and tested in MacOS and Linux systems.
ViralCC
mainly depends on the Python scientific stack.
numpy
scipy
pysam
scikit-learn
pandas
Biopython
leidenalg
We recommend using conda to install ViralCC
.
Typical installation time is 1-5 minutes depending on your system.
git clone https://github.com/dyxstat/ViralCC.git
Once complete, enter the repository folder and then create a ViralCC
environment using conda.
cd ViralCC
conda env create -f viralcc_linux_env.yaml
or
conda env create -f viralcc_osx_env.yaml
conda activate ViralCC_env
We provide a small simulated dataset, located under the Test directory, to demo and test the software:
Test/final.contigs.fa
Test/MAP_SORTED.bam
Test/viral_contigs.txt
Run ViralCC
on the testing dataset:
python ./viralcc.py pipeline -v Test/final.contigs.fa Test/MAP_SORTED.bam Test/viral_contigs.txt Test/out_test
The expected run time for demo is several seconds and the expected output are in the 'Test/out_test' directory:
Test/out_test/cluster_viral_contig.txt
Test/out_test/prokaryotic_contig_info.csv
Test/out_test/VIRAL_BIN/VIRAL_BIN0000.fa
Test/out_test/VIRAL_BIN/VIRAL_BIN0001.fa
Test/out_test/viralcc.log
Test/out_test/viral_contig_info.csv
Follow the instructions in this section to process the raw shotgun and Hi-C data and generate the input for ViralCC
:
Adaptor sequences are removed by bbduk
from the BBTools
suite with parameter ktrim=r k=23 mink=11 hdist=1 minlen=50 tpe tbo
and reads are quality-trimmed using bbduk
with parameters trimq=10 qtrim=r ftm=5 minlen=50
. Additionally, the first 10 nucleotides of Hi-C reads are trimmed by bbduk
with parameter ftl=10
. Identical PCR optical and tile-edge duplicates for Hi-C reads were removed by the script clumpify.sh
from BBTools
suite.
For the shotgun library, de novo metagenome assembly is produced by an assembly software, such as MEGAHIT.
megahit -1 SG1.fastq.gz -2 SG2.fastq.gz -o ASSEMBLY --min-contig-len 1000 --k-min 21 --k-max 141 --k-step 12 --merge-level 20,0.95
Hi-C paired-end reads are aligned to assembled contigs using a DNA mapping software, such as BWA MEM. Then, samtools with parameters ‘view -F 0x904’ is applied to remove unmapped reads, supplementary alignments, and secondary alignments. BAM file needs to be sorted by name using 'samtools sort'.
bwa index final.contigs.fa
bwa mem -5SP final.contigs.fa hic_read1.fastq.gz hic_read2.fastq.gz > MAP.sam
samtools view -F 0x904 -bS MAP.sam > MAP_UNSORTED.bam
samtools sort -n MAP_UNSORTED.bam -o MAP_SORTED.bam
Assembled contigs were screened by a viral sequence detection software, such as VirSorter to identify viral contigs.
wrapper_phage_contigs_sorter_iPlant.pl -f final.contigs.fa --db 1 --wdir virsorter_output --data-dir virsorter-data
python ./viralcc.py pipeline [Parameters] FASTA_file BAM_file VIRAL_file OUTPUT_directory
--min-len: Minimum acceptable contig length (default 1000)
--min-mapq: Minimum acceptable alignment quality (default 30)
--min-match: Accepted alignments must be at least N matches (default 30)
--min-k: Lower bound of k for determining the host poximity graph (default 4)
--random-seed: Random seed for the Leiden clustering (default 42)
--cover (optional): Cover existing files. Otherwise, an error will be returned if the output file is detected to exist.
-v (optional): Verbose output about more specific details of the ViralCC procedure.
python ./viralcc.py pipeline -v final.contigs.fa MAP_SORTED.bam viral_contigs.txt out_directory
If you have any questions or suggestions, welcome to contact Yuxuan Du (yuxuandu@usc.edu).