December 2021
Kristopher Kieft
kieft@wisc.edu
Anantharaman Lab
University of Wisconsin-Madison
v1.1.0
If you find PropagAtE useful please consider citing our manuscript in mSystems:
Kieft, K., and Anantharaman, K. (2022). Deciphering active prophages from metagenomes. mSystems, 7 (2), e00084-22.
-c
from 1.75 to 1.65. No new version. No significant effect on results. PropagAtE (Prophage Activity Estimator) uses genomic coordinates of integrated prophage sequences and short sequencing reads to estimate if a given prophage was in the lysogenic (dormant) or lytic (active) stage of infection. Prophages are designated according to a genomic/scaffold coordinate file, either manually generated by the user or taken directly from a VIBRANT (at least v1.2.1) output. The prophage:host read coverage ratio and corresponding effect size are used to estimate if the prophage was actively replicating its genome (significantly more prophage genome copies than host copies). PropagAtE is customizable to take in complete genomes or metagenomic scaffolds along with raw Illumina (short) reads, or instead take pre-aligned data files (sam or bam format). Threshold values are customizable but PropagAtE outputs clear “active” versus “dormant” estimations of given prophages with associated statistics.
System Requirements: PropagAtE has been tested and successfully run on Mac, Linux and Ubuntu systems.
Program Dependencies: Python3, Bowtie2, Samtools (see section below)
Python Dependencies: PySam, Numpy, Numba
Please ensure the following programs are installed and in your machine's PATH. Note: most downloads will automatically place these programs in your PATH.
conda install -c bioconda bowtie2
, GitHub or follow instructions in the Bowtie2 manual. There are two Python3 dependencies that may not be installed. The remaining dependencies should already be installed.
VIBRANT is not a dependency but is useful for identifying prophages and can be used to easily input prophage coordinates to PropagAtE. Documentation for VIBRANT can be found on GitHub here. VIBRANT and PropagAtE were developed by the same author.
PropagAtE is built for efficiently running on metagenomes, individual isolates genomes or genome scaffold fragments. Each prophage per genome/scaffold is considered individually, so results will not vary whether the scaffold is run as part of a metagenome or by itself.
git clone https://github.com/AnantharamanLab/PropagAtE
cd PropagAtE
pip install .
← NOTE: don't forget the dot (pip install [dot]
) Installing with pip is optional but suggested. Using pip will collect dependencies and add PropagAtE (Propagate
executable) to your system PATH. Without pip, PropagAtE can still be executed directly from the git clone, just ensure executable permissions (chmod +x Propagate/*
from within the main PropagAtE directory). Note that a new folder (PropagAtE.egg-info
) should appear after installing with pip.
Test out a small dataset of mixed active and dormant prophages. These examples assume the command is being called from the example_output/active
or example_output/dormant
folders.
Note: PropagAtE does not write to standard out (command prompt screen) while running or when it finishes (i.e., not verbose). However, PropagAtE will write to standard out in the event that it encounters an error, such as incorrect use of optional arguments, incorrect input file format, missing dependencies or incorrect dependency versions.
Note: The ways to run PropagAtE (i.e., set up flags) are not limited to these test examples.
5) Dormant prophage test: The inputs are scaffold sequences, short reads, and a VIBRANT prophage coordinates file. The reads may be unzipped or in gzip format depending on preference. Here they are gzipped for easier upload/download on GitHub. You may need to specify python3
at the beginning of the command.
cd example_output/dormant
Propagate -f example_sequence.fasta -r sample_forward_reads.fastq.gz sample_reverse_reads.fastq.gz -v VIBRANT_integrated_prophage_coordinates_example.tsv -o PropagAtE_example_results_dormant --clean -t 2
6) Active prophage test: The inputs are a sorted BAM format alignment file and a manually generated prophage coordinates file.
cd example_output/active
Propagate -f AE017333_partial_genome.fasta -b AE017333_partial_genome.sorted.bam -v manual_prophage_coordinates_AE017333.tsv -o PropagAtE_example_results_active
Due to large file sizes the full data (i.e., full alignment and read sets) for the active prophage example could not be uploaded to GitHub. Please see the read set SRR1137233 from Hertel et al. 2015 and the genome AE017333.1 for the full data.
Quick Guide
-f
and -v
-b
, -s
, -r
, -i
, -u
)-o
and -t
Both -f
and -v
are required for every run
-f
: input genomes/scaffolds (fasta file). This file should contain sequences that include prophage and host regions, not strictly prophages themselves. The definition lines cannot have special characters, namely quotations, pipe symbol or commas. PropagAtE requires at a minimum 1000bp of prophage to run analyses. Only the scaffolds indicated by -v
will be considered, but this -f
file can contain extra sequences that will be ignored.-v
: prophage coordinates input in the format of either (1) VIBRANT results coordinate file or (2) manually generated coordinate file. See the next two bullet points.
VIBRANT_integrated_prophage_coordinates
and can be found in the VIBRANT_results
output folder. No modification needs to be done for this file to be used an input for PropagAtE. The columns used are scaffold
, fragment
, nucleotide start
and nucleotide stop
.scaffold
, fragment
, start
and stop
. Note that the terms fragment
and prophage
are interchangeable in this format.
scaffold
is the name of the entire host sequencing that contains the prophage(s). Example: scaffold_999
fragment
is the name of the prophage fragment. Example: scaffold_999_fragment_1
or prophage_
start
is the nucleotide number where the prophage starts. Example: 2500
stop
is the nucleotide number where the prophage stops. Example: 58000
For every run, pick one of the following options as input for coverage information. Only one file is given with the exception of -r
in which forward and reverse read files are given. PropagAtE only functions on a single sample to identify prophage activity rather than multi-sample coverages.
-b
: input BAM sequence alignment file (sorted or unsorted). This will be sorted (if necessary) and indexed (if necessary). This format is used for analysis.-s
: input SAM sequence alignment file. This will be directly converted to BAM format for analysis.-r
: input paired short reads separated by a space. Example: -r forward_1.fastq reverse_2.fastq
. -i
: input interleaved paired short reads.-u
: input unpaired short reads
NOTE: For reads input (-r
,-i
,-u
) Bowtie2 (--no-discordant with -r
,-i
) will be used to generate a SAM file, which is then converted to BAM format. Reads can be in gzip format. Here you can specify an output file and number of threads to use. Number of threads will mainly effect the runtime of Bowtie2.
-o
: name of an output folder to deposit results. If not specified, the default is 'PropagAtE_results' followed by the basename of -v
. -t
: number of threads to use for Bowtie2 mapping as well as Samtools converting/sorting/indexing. These flags are often not used. However, they can be used to modify the method of coverage calculation or how active versus dormant is considered.
-p
: minimum percent identity per aligned read for calculating coverage. The default is 0.97 (97%). This option pertains to any coverage input (reads or SAM/BAM). Lowering this value will make the alignment filtering less strict. -e
: minimum effect size for significance by Cohen's d test. The default is 0.70
and the minimum is 0.60
. Values greater than 0.70
will represent a more significant difference in a prophage:host coverage ratio. Setting values below 0.75
may introduce false identifications (i.e., dormant prophages identified as active) whereas setting the value too high (e.g., 1.5
) may reduce identification of active prophages. -c
: minimum prophage:host coverage ratio for significance. The default is 2.0
and the minimum is 1.5
. Setting values below 2.0
may introduce false identifications (i.e., dormant prophages identified as active) whereas setting the value too high (e.g., 10
) may reduce identification of active prophages. --mask
: mask coverage values --min
: minimum average coverage to consider a prophage present and for --breadth. The default is 1.0x coverage. Prophage with an average coverage below this threshold will not be active regardless of host coverage. See --breadth
for another function of this flag. --breadth
: minimum breadth of coverage as fraction of bases >= minimum coverage (--min
). The default is 0.50 (50%). Example: with --min 1.0 --breadth 0.50
, if > 50% of a prophage genome has a coverage < 1.0, regardless of the average coverage, it will not be considered as active. This helps to ensure uneven and misleading alignments do not result in false positives. --clean
: use this setting to remove any generated SAM, unsorted BAM and/or Bowtie2 index files. All user input data files (regardless of format) and sorted BAM files will always be retained. SAM/BAM files can be very large, and Bowtie2 index files are typically temporary. Off by default.PropagAtE will always generate two files: the results tab-separated spreadsheet (.tsv) and a log file (.log). The presence or absence of generated SAM, BAM and Bowtie2 index files will depend on the data inputs and user set flags.
Log file: At the top it will contain information about the input command and well as run info (date, time, version). The next section includes an overview of processes that were run, the time post-start and general info for some of the processes. For example, this will include the number of hosts and prophages detected. Finally, the number of prophages identified as active will be listed. The log file may contain Error messages when applicable.
Results file: The results spreadsheet contains the finalize active versus dormant results as well as all relevant metrics and statistics. The following are column names and explanations of the results file:
-e
and -c
cutoffs but not the --min
or --breadth
cutoffs, generally considering it as "not active" but lacking evidence to call it "dormant". "not present" indicates the prophage had no coverage. Please contact Kristopher Kieft (kieft@wisc.edu or GitHub Issues) with any questions, concerns or comments.
Thank you for using PropagAtE!
##
## ##
## ##
###### ## ## ## ####### ###### ##### ## ##
## ## ## ## ## ## ## ## ## ## ##
###### ###### ###### ## ### ###### ### ##
## ## ## ## ## ## ## ## ## ##
## ## ## ## ## ####### ###### ##### ##
# ## #
# # ## # #
# #
PropagAtE: Prophage Activity Estimator Copyright (C) 2021 Kristopher Kieft
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.