Discovery of Novel Endogenous Viruses
This project strives to implement the BUD algorithm from ViruSpy in Python. In short, reads from a Short Read Archive (SRA) are screened for putative virus motifs/domains using known virus nucleotide an protein sequences. Sequences containing such motifs or domains serve as queries in subsequent searches using the same SRA to extend the initial sequence until non-virus sequences/domains are encountered. This indicates that either en exogenous virus has been identified or an endogenous virus within a the host genome.
Setup analysis enviroment:
git clone https://github.com/NCBI-Hackathons/EndoVir.git
cd Endovir
./setup.sh
All external tools have to be currently in $PATH
. Please see the corresponding
README files for installation instructions.
cd $ENDOVIR
(should be in EndoVir/)mkdir tools
cd tools
MagicBLAST 1.3.0 [check for updates]
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/LATEST/ncbi-magicblast-1.3.0-x64-linux.tar.gz
tar -xvzf ncbi-magicblast-1.3.0-x64-linux.tar.gz
export PATH=$PATH:$ENDOVIR/tools/ncbi-magicblast-1.3.0/bin/
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-centos_linux64.tar.gz
tar -xvzf sratoolkit.2.8.2-1-centos_linux64.tar.gz
export PATH=$PATH:$ENDOVIR/tools/sratoolkit.2.8.2-1-centos_linux64/bin/
git clone https://github.com/voutcn/megahit.git
cd megahit
make -j $(cat /proc/cpuinfo | grep processor | wc -l)
(-j n, where n is the # of cores you want to use)export PATH=$PATH:$ENDOVIR/tools/megahit
*- [ABYSS2]
wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/2.0.2/abyss-2.0.2.tar.gzecho "PATH=$PATH:$ENDOVIRPATH" >> ~/.bashrc
# only if you switch the console/login after installsource ~/.bashrc
# just in caseChange into your working directory work
cd $ENDOVIR/work
python3.6 ../src/endovir.py
The underlying design of endovir
will facilitate the use of external tools,
e.g. assemblers or parser, without changing the BUD routine itself. Further, the
results of the intermediate steps can be parsed and used to set the parameters
for each subsequent step in the analysis pipeline.
The use of STDIN and STDOUT is used were possible to communicate between the external tools, thereby reducing the usage of intermediate files as much as possible. In addition, only the Python standard libraries should be used.
The pipline has three major steps (in src
):
endovir.Endovir()
: creates the analysis environment and prepares the
screening.
screener.Screener()
: Initiates the screen and identifies the initial,
putative virus contigs.
viruscontig.VirusContig()
: Each putative virus contig is expanded and
analyzed independently.
external screening tools, e.g MagicBLAST, have wrappers and parser in their
corresponding namespace in lib
.
Only the Python standard libraries are used. However, the pipeline depends on
several external tools which are called using the subprocess
module: