NCBI-Hackathons / EndoVir

Discovery of Novel Endogenous Viruses
MIT License
6 stars 4 forks source link

EndoVir

Discovery of Novel Endogenous Viruses

Abstract

This project strives to implement the BUD algorithm from ViruSpy in Python. In short, reads from a Short Read Archive (SRA) are screened for putative virus motifs/domains using known virus nucleotide an protein sequences. Sequences containing such motifs or domains serve as queries in subsequent searches using the same SRA to extend the initial sequence until non-virus sequences/domains are encountered. This indicates that either en exogenous virus has been identified or an endogenous virus within a the host genome.

Setup

Setup analysis enviroment:

Install external tools:

All external tools have to be currently in $PATH. Please see the corresponding README files for installation instructions.

Run

Change into your working directory work

Design

The underlying design of endovir will facilitate the use of external tools, e.g. assemblers or parser, without changing the BUD routine itself. Further, the results of the intermediate steps can be parsed and used to set the parameters for each subsequent step in the analysis pipeline.

The use of STDIN and STDOUT is used were possible to communicate between the external tools, thereby reducing the usage of intermediate files as much as possible. In addition, only the Python standard libraries should be used.

Workflow diagram

Endovir diagram

Pipeline approach

The pipline has three major steps (in src):

external screening tools, e.g MagicBLAST, have wrappers and parser in their corresponding namespace in lib.

Dependencies

Only the Python standard libraries are used. However, the pipeline depends on several external tools which are called using the subprocess module:

References: