BioMedBigDataCenter / VENAS

15 stars 7 forks source link

VENAS:a Viral genome Evolution Network Analysis System

Introduction

Comprehensive analyses of viral genomes can provide a global picture of SARS-CoV-2 transmission and help to predict the oncoming trends of the pandemic. However, the rapid accumulation of SARS-CoV-2 genomes presents an unprecedented data size and complexity that has exceeded the capacity of existing methods in constructing evolution network through virus genotyping. The VENAS seeks to apply reliable computational algorithms to build an integrative genomic analysis system that enables researchers to trace viral mutations along the transmission routes using the daily updated SARS-CoV-2 genomes.

VENAS can construct the network from an alignment file containing 10k sequences in about 10 minutes, including:

Pre-requisites

VENAS requires python 3 with PyPy, argparse, pandas, numpy (http://www.numpy.org/), networkx(version=2.5), CDlib, matplotlib, biopython (http://biopython.org/wiki/Main_Page), click, tqdm libraries installed. If you want to provide a fasta file as input file, VENAS also needs the MAFFT (https://mafft.cbrc.jp/alignment/software/) in the executable path. Then you can use the “multi_mafft.py” to perform a multi-threaded multiple sequence alignment.

You will also need make and gcc with C++17 support in order to compile the parallel implementation for haplotype_network.py (see Part2).

Installation

Cloning the repository via the following commands

$ git clone https://github.com/qianjiaqiang/VENAS.git

Build the shared library

$ cd parham && make && cd ..

Basic Usage

This section presents some basic usages of VENAS. We assume here that all the scripts are in the system path.

Part 1: Effective parsimony-informative site (ePIS) finding and Minor allele frequency calculating

Note: The i parameter is the directory where the input file is located. The f parameter is the reference genome sequence id in the ma file.

#!bash
python -u parsimony-informative.py -i example_data -m variation_graph_taxonid_2697049_outgroupid_none.ma -b none -r 0 -f "OEAV139851"

Parameter Description:

Results Description:

Part 2: Viral genome evolution network construction

#!bash
python -u haplotype_network.py example_data

Results Description:

Part 3: Topological classification and major path recognition

Note: Only the first two columns are needed in the output “net_all.txt” file of the Part 2 step, which can be handled as described below.

#!bash
awk -F'\t' '{print $1","$2}' example_data/net_all.txt > net.csv
sed -i '1i\Source,Target' net.csv

Example input net.csv:

Source,Target
1,57
5,6
1,210
10,23
10,59
10,69
3,191
10,91

If you have already processed the net.csv file, you are ready for Part 3.

#!bash
python main_path_example.py

Results Description:

The result net.csv and nodeTable.csv files are in the current working directory. You can visualize the result viral genome evolution network using a general relationship graph or force-directed graph tools, such as the web-based Apache Echarts (https://echarts.apache.org/), d3.js (https://d3js.org/), or the application-based Gephi (recommend).

Publication

Y. Ling, R. Cao, J. Qian et al. An interactive viral genome evolution network analysis system enabling rapid large-scale molecular tracing of SARS-CoV-2, Science Bulletin 2022, 67(7):665-669. [https://doi.org/10.1016/j.scib.2022.01.001]

About Us

Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences.