AnantharamanLab / PropagAtE

Prophage Activity Estimator
GNU General Public License v3.0
26 stars 3 forks source link

PropagAtE

Prophage Activity Estimator

December 2021
Kristopher Kieft
kieft@wisc.edu
Anantharaman Lab
University of Wisconsin-Madison

Current Version

v1.1.0

Citation

If you find PropagAtE useful please consider citing our manuscript in mSystems:
Kieft, K., and Anantharaman, K. (2022). Deciphering active prophages from metagenomes. mSystems, 7 (2), e00084-22.


Table of Contents:

  1. Updates
    • v1.1.0
    • v1.0.0
  2. Program Description
  3. Requirements
    • Program Dependencies
    • Python3 Dependencies
  4. Running PropagAtE
    • Quick Start
    • Testing PropagAtE
  5. Flag Descriptions
    • Required
    • Pick one
    • Common
    • Additional
  6. Output Explanations
  7. Contact

Updates for v1.1.0:

Updates for v1.0.0:


Program Description

PropagAtE (Prophage Activity Estimator) uses genomic coordinates of integrated prophage sequences and short sequencing reads to estimate if a given prophage was in the lysogenic (dormant) or lytic (active) stage of infection. Prophages are designated according to a genomic/scaffold coordinate file, either manually generated by the user or taken directly from a VIBRANT (at least v1.2.1) output. The prophage:host read coverage ratio and corresponding effect size are used to estimate if the prophage was actively replicating its genome (significantly more prophage genome copies than host copies). PropagAtE is customizable to take in complete genomes or metagenomic scaffolds along with raw Illumina (short) reads, or instead take pre-aligned data files (sam or bam format). Threshold values are customizable but PropagAtE outputs clear “active” versus “dormant” estimations of given prophages with associated statistics.

Utility
Cautions

Requirements

System Requirements: PropagAtE has been tested and successfully run on Mac, Linux and Ubuntu systems.
Program Dependencies: Python3, Bowtie2, Samtools (see section below)
Python Dependencies: PySam, Numpy, Numba

Program Dependencies: Installation

Please ensure the following programs are installed and in your machine's PATH. Note: most downloads will automatically place these programs in your PATH.

Programs:
  1. Python3 (version >= 3.5)
  2. Bowtie2 (optional)
  3. Samtools
Example Installations
  1. Python3: see Python webpage.
  2. Bowtie2: conda install -c bioconda bowtie2, GitHub or follow instructions in the Bowtie2 manual.
  3. Samtools: GitHub or follow instructions on the Samtools webpage

Python3 Dependencies: Installation

There are two Python3 dependencies that may not be installed. The remaining dependencies should already be installed.

Packages
  1. PySam (version >= 0.15.0)
  2. Numpy (version >= 1.17.0)
  3. Numba (version >= 0.50.0)

Other

VIBRANT is not a dependency but is useful for identifying prophages and can be used to easily input prophage coordinates to PropagAtE. Documentation for VIBRANT can be found on GitHub here. VIBRANT and PropagAtE were developed by the same author.


Running PropagAtE

PropagAtE is built for efficiently running on metagenomes, individual isolates genomes or genome scaffold fragments. Each prophage per genome/scaffold is considered individually, so results will not vary whether the scaffold is run as part of a metagenome or by itself.

Installation/Download

  1. git clone https://github.com/AnantharamanLab/PropagAtE
    (optional) create a conda environment and activate it before proceeding to pip install
  2. cd PropagAtE
  3. pip install . ← NOTE: don't forget the dot (pip install [dot])

Installing with pip is optional but suggested. Using pip will collect dependencies and add PropagAtE (Propagate executable) to your system PATH. Without pip, PropagAtE can still be executed directly from the git clone, just ensure executable permissions (chmod +x Propagate/* from within the main PropagAtE directory). Note that a new folder (PropagAtE.egg-info) should appear after installing with pip.

Testing PropagAtE

Test out a small dataset of mixed active and dormant prophages. These examples assume the command is being called from the example_output/active or example_output/dormant folders.

Note: PropagAtE does not write to standard out (command prompt screen) while running or when it finishes (i.e., not verbose). However, PropagAtE will write to standard out in the event that it encounters an error, such as incorrect use of optional arguments, incorrect input file format, missing dependencies or incorrect dependency versions.

Note: The ways to run PropagAtE (i.e., set up flags) are not limited to these test examples.

5) Dormant prophage test: The inputs are scaffold sequences, short reads, and a VIBRANT prophage coordinates file. The reads may be unzipped or in gzip format depending on preference. Here they are gzipped for easier upload/download on GitHub. You may need to specify python3 at the beginning of the command.

  1. cd example_output/dormant
  2. Propagate -f example_sequence.fasta -r sample_forward_reads.fastq.gz sample_reverse_reads.fastq.gz -v VIBRANT_integrated_prophage_coordinates_example.tsv -o PropagAtE_example_results_dormant --clean -t 2

6) Active prophage test: The inputs are a sorted BAM format alignment file and a manually generated prophage coordinates file.

  1. cd example_output/active
  2. Propagate -f AE017333_partial_genome.fasta -b AE017333_partial_genome.sorted.bam -v manual_prophage_coordinates_AE017333.tsv -o PropagAtE_example_results_active

Due to large file sizes the full data (i.e., full alignment and read sets) for the active prophage example could not be uploaded to GitHub. Please see the read set SRR1137233 from Hertel et al. 2015 and the genome AE017333.1 for the full data.

Flag Descriptions

Input Data

Quick Guide

  1. Specify -f and -v
  2. Pick a coverage input (-b, -s, -r, -i, -u)
  3. (optional) Provide -o and -t
  4. (optional) Modify the methods and outputs with additional flags

Required

Both -f and -v are required for every run

Pick one

For every run, pick one of the following options as input for coverage information. Only one file is given with the exception of -r in which forward and reverse read files are given. PropagAtE only functions on a single sample to identify prophage activity rather than multi-sample coverages.

Common

Here you can specify an output file and number of threads to use. Number of threads will mainly effect the runtime of Bowtie2.

Additional flags

These flags are often not used. However, they can be used to modify the method of coverage calculation or how active versus dormant is considered.

Output Explanations

PropagAtE will always generate two files: the results tab-separated spreadsheet (.tsv) and a log file (.log). The presence or absence of generated SAM, BAM and Bowtie2 index files will depend on the data inputs and user set flags.

Contact

Please contact Kristopher Kieft (kieft@wisc.edu or GitHub Issues) with any questions, concerns or comments.

Thank you for using PropagAtE!


                                                               ##
                                                             ##  ##
                                                           ##      ##
######   ##  ##     ##     #######   ######    #####       ##      ##
##  ##   ##  ##   ##  ##   ##        ##       ##             ##  ##
######   ######   ######   ##  ###   ######    ###             ##
##       ##  ##   ##  ##   ##   ##   ##           ##           ##
##       ##  ##   ##  ##   #######   ######   #####            ##
                                                            #  ##  #
                                                           # # ## # #
                                                         #            #

Copyright

PropagAtE: Prophage Activity Estimator Copyright (C) 2021 Kristopher Kieft

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.