multiPhATE v.2.0.2

/MultiPhATE/ - multiPhATE2 - https://github.com/carolzhou/multiPhATE2.git

This code was developed by Carol L. Ecale Zhou and Jeffrey Kimbrel at Lawrence Livermore National Laboratory.

THIS CODE IS COVERED BY THE GPL-3 LICENSE. SEE INCLUDED FILE GPL-3.pdf FOR DETAILS.

Index

ABOUT THE MULTI-PHATE PIPELINE DRIVER
ABOUT THE PHATE PIPELINE
ABOUT COMPARE-GENE-PROFILES and the GENOMICS MODULE
Quick Start Guide (TL;DR)
HOW TO SET UP MULTI-PHATE ON YOUR LOCAL MACHINE
HOW TO WRITE A CONFIGURATION FILE
PIPELINE EXECUTION
HOW TO USE CHECKPOINTING
SUPPORTING DATABASES
SUPPORTING 3rd PARTY CODES
CONDA INSTALLATION
MultiPHATE OUTPUT FILES
INSTALLATION AND SET-UP CHECKLIST
TROUBLESHOOTING
RUNNING PHATE AS AN "EMBARASSINGLY PARALLEL" CODE
FURTHER RECOMMENDATIONS
CAUTIONS
PUBLICATION
WHAT'S NEW?

ABOUT THE MULTI-PHATE PIPELINE DRIVER

MultiPhATE is a command-line program that runs gene finding and the PhATE annotation code over user-specified phage genomes, then performs gene-by-gene comparisons among the genomes. The multiPhate.py code takes a single argument consisting of a configuration file (hereafter referred to as, multiPhate.config; use the file sample.multiPhate.config as starting point) and uses it to specify annotation parameters. Then, multiPhate.py invokes the PhATE pipeline for each genome. See below for the types of annotations that PhATE performs. If two or more genomes are specified by the user, then multiPhATE will run the CompareGeneProfiles code to identify corresponding genes among the genomes.

ABOUT THE PHATE PIPELINE

PhATE is a fully automated computational pipeline for identifying and annotating phage genes in genome sequence. PhATE is written in Python 3.7, and runs on Linux and Mac operating systems. Code execution is controled by a configuration file, which can be tailored to run specific gene finders and to blast sequences against specific phage- and virus-centric data sets, in addition to more generic (genome, protein) data sets. See below for the specific databases that are accommodated. PhATE runs at least one gene finding algorithm, then annotates the genome, gene, and protein sequences using nucleotide and protein blast flavors and a set of fasta sequence databases, and uses hmm searches (phmmer, jackhmmer) against these same fasta databases. It also runs hmmscan against the pVOG and VOG hmm profile databases. If more than one gene finder is run, PhATE will provide a side-by-side comparison of the genes called by each gene caller. The user specifies the preferred gene caller, and the genes and proteins predicted by that caller are annotated using blast against the supporting databases (or, the user may specify one of the comparison gene sets: superset, consensus, or commoncore, for functional annotation). Classification of each protein sequence into a pVOG or VOG group is followed by generation of an alignment-ready fasta file. By convention, genome sequence files end with extension, ".fasta"; gene nucleotide fasta files end with, ".fnt", and cds amino-acid fasta files end with, ".faa".

ABOUT COMPARE-GENE-PROFILES and the GENOMICS MODULE

CompareGeneProfiles performs binary blast (NxN) of the genes from each genome against the genes from every other genome provided by the user. The code then identifies for each gene its mutual and non-mutual (singular) best hits against corresponding genes from each of the other genomes, and reports if no corresponding hit is found. For each binary genome-to-genome comparison, hits are ordered with respect to the query (reference, or first) genome. The Genomics module inputs the binary blast results files from CompareGeneProfiles and computes genes and proteins that correspond across all the input genomes with respect to the reference genome. Ultimately, homology groups comprising each reference gene (or protein) and its corresponding genes, plus its homologs and their corresponding genes. Homology groups are output as fasta files and annotation files.

Quick Start Guide (TL;DR)

This guide is for the lazy or impatient, and intended to get you up and running with multiPhATE2 quickly. You should read the detailed instructions below to understand what is happening.

Install multiPhate in your home directory

cd ~
git clone https://github.com/carolzhou/multiPhATE2

Install the other dependencies using conda

conda create -n multiphate2 -c conda-forge -c bioconda -c hcc biopython emboss blast glimmer phanotate prodigal hmmer trnascan-se wget clustalo 
conda activate multiphate2

Format the databases

cd ~/multiPhATE2/Databases/Phantome
makeblastdb -in Phantome_Phage_genes.faa -dbtype prot
cd ~/multiPhATE2

Copy your genomes into PipelineInput

cp ~/genome1.fasta PipelineInput

Make a copy of sample.multiPhate.config and configure it

cp sample.multiPhate.config genome1.multiPhate.config
vi genome1.multiPhate.config

In particular: i. edit the genome information in Genome List ii. set the following gene callers (these are the ones installed from conda):

phanotate_calls='true'
genemarks_calls='false'
prodigal_calls='true'
glimmer_calls='true' iii. Enable blastp against the default databases
blastp='true'
pvogs_blast='true'
phantome_blast='true' iv. Set the paths to the databases
phantome_database_path='$HOME/multiphate2/multiPhATE2/Databases/Phantome'
pvogs_database_path='$HOME/multiphate2/multiPhATE2/Databases/pVOGs' NOTE: \$HOME may not be expanded and so you should probably use the full path here. cd to the directory and use pwd to get the path v. Optionally set the parallelization options

Run multiphate

python3 multiPhate.py multiPhate.config

carolzhou / multiPhATE2

readme

multiPhATE v.2.0.2

Index

ABOUT THE MULTI-PHATE PIPELINE DRIVER

ABOUT THE PHATE PIPELINE

ABOUT COMPARE-GENE-PROFILES and the GENOMICS MODULE

Quick Start Guide (TL;DR)

HOW TO SET UP MULTI-PHATE ON YOUR LOCAL MACHINE

HOW TO WRITE A CONFIGURATION FILE

PIPELINE EXECUTION

HOW TO USE CHECKPOINTING

SUPPORTING DATABASES

SUPPORTING 3rd PARTY CODES

Conda Installation

MultiPHATE OUTPUT FILES

INSTALLATION AND SET-UP CHECKLIST

TROUBLESHOOTING

RUNNING PHATE AS AN "EMBARASSINGLY PARALLEL" CODE

FURTHER RECOMMENDATIONS

CAUTIONS

PUBLICATION

WHAT'S NEW?