gbouras13 / pharokka

fast phage annotation program
MIT License
147 stars 15 forks source link

Open In Colab

Paper CI BioConda Install codecov

Anaconda-Server Badge Bioconda Downloads PyPI version Downloads

pharokka

pharokka Logo

Extra special thanks to Ghais Houtak for making Pharokka's logo.

Fast Phage Annotation Tool

pharokka is a rapid standardised annotation tool for bacteriophage genomes and metagenomes.

If you are looking for rapid standardised annotation of bacterial genomes, please use Bakta. Prokka, which inspired the creation and naming of pharokka, is another good option, but Bakta is Prokka's worthy successor.

phold

If you like pharokka, you will probably love phold. phold uses structural homology to improve phage annotation. Benchmarking is ongoing but phold strongly outperforms pharokka in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.

pharokka still has features phold lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it is recommended to run phold after running pharokka.

phold takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with phold.

Google Colab Notebooks

If you don't want to install pharokka or phold locally, you can run pharokka and phold (and phynteny), or only pharokka, without any code using the Google Colab notebook.

Table of Contents

Quick Start

The easiest way to install pharokka is via conda:

conda install -c bioconda pharokka

Followed by database download and installation:

install_databases.py -o <path/to/databse_dir>

And finally annotation:

pharokka.py -i <phage fasta file> -o <output directory> -d <path/to/database_dir> -t <threads>

As of pharokka v1.4.0, if you want extremely rapid PHROG annotations, use --fast:

pharokka.py -i <phage fasta file> -o <output directory> -d <path/to/database_dir> -t <threads> --fast

Documentation

Check out the full documentation at https://pharokka.readthedocs.io.

Paper

pharokka has been published in Bioinformatics:

George Bouras, Roshan Nepal, Ghais Houtak, Alkis James Psaltis, Peter-John Wormald, Sarah Vreugde, Pharokka: a fast scalable bacteriophage annotation tool, Bioinformatics, Volume 39, Issue 1, January 2023, btac776, https://doi.org/10.1093/bioinformatics/btac776.

If you use pharokka, please see the full Citation section for a list of all programs pharokka uses, in order to fully recognise the creators of these tools for their work.

Pharokka with Galaxy Europe Webserver

Thanks to some amazing assistance from Paul Zierep, you can run pharokka using the Galaxy Europe webserver. There is no plotting functionality at the moment.

So if you can't get pharokka to install on your machine for whatever reason or want a GUI to annotate your phage(s), please give it a go there.

Brief Overview

pharokka Workflow

pharokka uses PHANOTATE, the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. Prodigal implemented with pyrodigal and Prodigal-gv implemented with pyrodigal-gv are also available as alternatives. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the PHROGs, CARD and VFDB databases using MMseqs2. As of v1.4.0, pharokka will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using PyHMMER. Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like Roary. pharokka also generates a cds_functions.tsv file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full usage and check out the full documentation for more details.

Pharokka v 1.7.0 Update

You can run pharokka_multiplotter.py to plot as many phage(s) as you want.

It requires the pharokka output Genbank file (here, pharokka.gbk). It will save plots for each contig in the output directory (here pharokka_plots_output_directory).

e.g.

pharokka_multiplotter.py -g pharokka.gbk  -o pharokka_plots_output_directory 

Pharokka v 1.6.0 Update (11 January 2024)

Pharokka v 1.5.0 Update (20 September 2023)

Pharokka v 1.4.0 Update (27 August 2023)

pharokka v1.4.0 is a large update implementing:

Pharokka v 1.3.0 Update

pharokka v1.3.0 implements pharokka_plotter.py, which creates a simple circular genome plot using the amazing pyCirclize package with output in PNG format. All CDS are coloured according to their PHROG functional group.

It is reasonably customisable and is designed for single input phage contigs. If an input FASTA with multiple contigs is entered, it will only plot the first contig.

It requires the input FASTA, pharokka output directory, and the -p or --prefix value used with pharokka if specified.

You can run pharokka_plotter.py in the following form

pharokka_plotter.py -i input.fasta -n pharokka_plot -o pharokka_output_directory 

This will create pharokka_plot.png as an output file plot of your phage.

An example plot is included below made with the following command (assuming Pharokka has been run with SAOMS1_pharokka_output_directory as the output directory).

pharokka_plotter.py -i test_data/SAOMS1.fasta -n SAOMS1_plot -o SAOMS1_pharokka_output_directory --interval 8000 --annotations 0.5 --plot_title '${Staphylococcus}$ Phage SAOMS1'

SAOMS1 example

SAOMS1 phage (GenBank: MW460250.1) was isolated and sequenced by: Yerushalmy, O., Alkalay-Oren, S., Coppenhagen-Glazer, S. and Hazan, R. from the Institute of Dental Sciences and School of Dental Medicine, Hebrew University, Israel.

Please see plotting for details on all plotting parameter options.

Installation

Conda Installation

The easiest way to install pharokka is via conda. For inexperienced command line users, this method is highly recommended.

conda install -c bioconda pharokka

This will install all the dependencies along with pharokka. The dependencies are listed in environment.yml.

If conda is taking a long time to solve the environment, try using mamba:

conda install mamba
mamba install -c bioconda pharokka

Pip

As of v1.4.0, you can also install the python components of pharokka with pip.

pip install pharokka

You will still need to install the non-python dependencies manually.

Container

If you have Docker/Singularity/Apptainer installed, you can use the biocontainers container (yes, every bioconda package has one!)

You might find this useful if you have trouble with conda environments.

For example to install pharokka v1.7.3 with Singularity:

IMAGE_DIR="<the directory you want the .sif file to be in >"
# e.g. to pull into the working directory
IMAGE_DIR=$PWD
singularity pull --dir $IMAGE_DIR docker://quay.io/biocontainers/pharokka:1.7.3--pyhdfd78af_0
containerImage="$IMAGE_DIR/pharokka_1.7.3--pyhdfd78af_0.sif"
singularity exec $containerImage pharokka.py -h

Source

Alternatively, the development version of pharokka (which may include new, untested features) can be installed manually via github.

git clone https://github.com/gbouras13/pharokka.git
cd pharokka
pip install -e .
pharokka.py --help

The dependencies found in environment.yml will then need to be installed manually.

For example using conda to install the required dependencies:

conda env create -f environment.yml
conda activate pharokka_env
# assuming you are in the pharokka directory 
# installs pharokka from source
pip install -e .
pharokka.py --help

Database Installation

To install the pharokka database to the default directory:

install_databases.py -d

If you would like to specify a different database directory (recommended), that can be achieved as follows:

install_databases.py -o <path/to/databse_dir>

If this does not work, you an alternatively download the databases from Zenodo at https://zenodo.org/record/8276347/files/pharokka_v1.4.0_databases.tar.gz and untar the directory in a location of your choice.

If you prefer to use the command line:

wget "https://zenodo.org/record/8276347/files/pharokka_v1.4.0_databases.tar.gz"
tar -xzf pharokka_v1.4.0_databases.tar.gz

which will create a directory called "pharokka_v1.4.0_databases" containing the databases.

Beginner Conda Installation

If you are new to using the command-line, please install conda using the following instructions.

  1. Install Anaconda. I would recommend miniconda.
  2. Assuming you are using a Linux x86_64 machine (for other architectures, please replace the URL with the appropriate one on the miniconda website).

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

For Mac (Intel, will also work with M1):

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

  1. Install miniconda and follow the prompts.

sh Miniconda3-latest-Linux-x86_64.sh

  1. After installation is complete, you should add the following channels to your conda configuration:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
  1. After this, conda should be installed (you may need to restart your terminal). It is recommended that mamba is also installed, as it will solve the enviroment quicker than conda:

conda install mamba

  1. Finally, I would recommend installing pharokka into a fresh environment. For example to create an environment called pharokkaENV with pharokka installed:
mamba create -n pharokkaENV pharokka
conda activate pharokkaENV
install_databases.py -h
pharokka.py -h

Usage

Once the databases have finished downloading, to run pharokka:

pharokka.py -i <fasta file> -o <output directory> -t <threads>

To specify a different database directory (recommended):

pharokka.py -i <fasta file> -o <output directory> -d <path/to/database_dir> -t <threads> -p <prefix>

For a full explanation of all arguments, please see usage.

pharokka defaults to 1 thread.

usage: pharokka.py [-h] [-i INFILE] [-o OUTDIR] [-d DATABASE] [-t THREADS] [-f] [-p PREFIX] [-l LOCUSTAG] [-g GENE_PREDICTOR] [-m] [-s]
                   [-c CODING_TABLE] [-e EVALUE] [--fast] [--mmseqs2_only] [--meta_hmm] [--dnaapler] [--custom_hmm CUSTOM_HMM] [--genbank]
                   [--terminase] [--terminase_strand TERMINASE_STRAND] [--terminase_start TERMINASE_START] [--skip_extra_annotations]
                   [--skip_mash] [--minced_args MINCED_ARGS] [--mash_distance MASH_DISTANCE] [-V] [--citation]

pharokka: fast phage annotation program

options:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Input genome file in fasta format.
  -o OUTDIR, --outdir OUTDIR
                        Directory to write the output to.
  -d DATABASE, --database DATABASE
                        Database directory. If the databases have been installed in the default directory, this is not required. Otherwise specify the path.
  -t THREADS, --threads THREADS
                        Number of threads. Defaults to 1.
  -f, --force           Overwrites the output directory.
  -p PREFIX, --prefix PREFIX
                        Prefix for output files. This is not required.
  -l LOCUSTAG, --locustag LOCUSTAG
                        User specified locus tag for the gff/gbk files. This is not required. A random locus tag will be generated instead.
  -g GENE_PREDICTOR, --gene_predictor GENE_PREDICTOR
                        User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". 
                        Defaults to phanotate (not required unless prodigal is desired).
  -m, --meta            meta mode for metavirome input samples
  -s, --split           split mode for metavirome samples. -m must also be specified. 
                        Will output separate split FASTA, gff and genbank files for each input contig.
  -c CODING_TABLE, --coding_table CODING_TABLE
                        translation table for prodigal. Defaults to 11.
  -e EVALUE, --evalue EVALUE
                        E-value threshold for MMseqs2 database PHROGs, VFDB and CARD and PyHMMER PHROGs database search. Defaults to 1E-05.
  --fast, --hmm_only    Runs PyHMMER (HMMs) with PHROGs only, not MMseqs2 with PHROGs, CARD or VFDB. 
                        Designed for phage isolates, will not likely be faster for large metagenomes.
  --mmseqs2_only        Runs MMseqs2 with PHROGs, CARD and VFDB only (same as Pharokka v1.3.2 and prior). Default in meta mode.
  --meta_hmm            Overrides --mmseqs2_only in meta mode. Will run both MMseqs2 and PyHMMER.
  --dnaapler            Runs dnaapler to automatically re-orient all contigs to begin with terminase large subunit if found. 
                        Recommended over using '--terminase'.
  --custom_hmm CUSTOM_HMM
                        Run pharokka with a custom HMM profile database suffixed .h3m. 
                        Please use create this with the create_custom_hmm.py script.
  --genbank             Flag denoting that -i/--input is a genbank file instead of the usual FASTA file. 
                         The CDS calls in this file will be preserved and re-annotated.
  --terminase           Runs terminase large subunit re-orientation mode. 
                        Single genome input only and requires --terminase_strand and --terminase_start to be specified.
  --terminase_strand TERMINASE_STRAND
                        Strand of terminase large subunit. Must be "pos" or "neg".
  --terminase_start TERMINASE_START
                        Start coordinate of the terminase large subunit.
  --skip_extra_annotations
                        Skips tRNAscan-se, MINced and Aragorn.
  --skip_mash           Skips running mash to find the closest match for each contig in INPHARED.
  --minced_args MINCED_ARGS
                        extra commands to pass to MINced (please omit the leading hyphen for the first argument). You will need to use quotation marks e.g. --minced_args "minNR 2 -minRL 21"
  --mash_distance MASH_DISTANCE
                        mash distance for the search against INPHARED. Defaults to 0.2.
  -V, --version         Print pharokka Version
  --citation            Print pharokka Citation

Version Log

A brief description of what is new in each update of pharokka can be found in the HISTORY.md file.

System

pharokka has been tested on Linux and MacOS (M1 and Intel).

Time

On a standard 16GB RAM laptop specifying 8 threads, pharokka should take between 3-10 minutes to run for a single phage, depending on the genome size.

In --fast mode, it should take 45-75 seconds.

Benchmarking v1.5.0

pharokka v1.5.0 was run on the 673 crAss phage dataset to showcase the improved CDS prediction of -g prodigal-gv for metagenomic datasets where some phages likely have alternative genetic codes (i.e. not 11).

All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 8 threads (-t 8). pyrodigal-gv v0.1.0 and pyrodigal v3.0.0 were used respectively.

673 crAss-like genomes pharokka v1.5.0 -g prodigal-gv pharokka v1.5.0 -g prodigal
Total CDS 81730 91999
Annotated Function CDS 20344 17458
Unknown Function CDS 61386 74541
Contigs with genetic code 15 229 NA
Contigs with genetic code 4 38 NA
Contigs with genetic code 11 406 673

Fewer (larger) CDS were predicted more accurately, leading to an increase in the number of coding sequences with annotated functions. Approximately 40% of contigs in this dataset were predicted to use non-standard genetic codes according to pyrodigal-gv.

Benchmarking v1.4.0

pharokka v1.4.0 has also been run on phage SAOMS1 and also the same 673 crAss phage dataset to showcase:

  1. The improved sensitivity of gene annotation with PyHMMER and a demonstration of how --fast is slower for metagenomes.
    • If you can deal with the compute cost (especially for large metagenomes), I highly recommend --fast or --meta_hmm for metagenomes given how much more sensitive HMM search is.
  2. The large speed-up over v1.3.2 with --fast for phage isolates - with the proviso that no virulence factors or AMR genes will be detected.
  3. The slight speed-up over v1.3.2 with --mmseqs2_only.

All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 16 threads (-t 16).

SAOMS1 was run with Phanotate

Phage SAOMS1 pharokka v1.4.0 --fast pharokka v1.4.0 pharokka v1.3.2
Time (min) 0.70 3.73 5.08
CDS 246 246 246
Annotated Function CDS 93 93 92
Unknown Function CDS 153 153 154

The 673 crAss-like genomes were run with -m (defaults to --mmseqs2_only in v 1.4.0) and with -g prodigal (pyrodigal v2.1.0).

673 crAss-like genomes pharokka v1.4.0 --fast pharokka v1.4.0 --mmseqs2_only pharokka v1.3.2
Time (min) 35.62 11.05 13.27
CDS 91999 91999 91999
Annotated Function CDS 16713 9150 9150
Unknown Function CDS 75286 82849 82849

Original Benchmarking (v1.1.0)

pharokka (v1.1.0) has been benchmarked on an Intel Xeon CPU E5-4610 v2 @ 2.30 specifying 16 threads. Below is benchamarking comparing pharokka run with PHANOTATE and Prodigal against Prokka v1.14.6 run with PHROGs HMM profiles, as modified by Andrew Millard (https://millardlab.org/2021/11/21/phage-annotation-with-phrogs/).

Benchmarking was conducted on Enterbacteria Phage Lambda (Genbank accession J02459) Staphylococcus Phage SAOMS1 (Genbank Accession MW460250) and 673 crAss-like phage genomes in one multiFASTA input taken from Yutin, N., Benler, S., Shmakov, S.A. et al. Analysis of metagenome-assembled viral genomes from the human gut reveals diverse putative CrAss-like phages with unique genomic features. Nat Commun 12, 1044 (2021) https://doi.org/10.1038/s41467-021-21350-w.

For the crAss-like phage genomes, pharokka meta mode -m was enabled.

Phage Lambda pharokka PHANOTATE pharokka Prodigal Prokka with PHROGs
Time (min) 4.19 3.88 0.27
CDS 88 61 62
Coding Density (%) 94.55 83.69 84.96
Annotated Function CDS 43 37 45
Unknown Function CDS 45 24 17
Phage SAOMS1 pharokka PHANOTATE pharokka Prodigal Prokka with PHROGs
Time (min) 4.26 3.89 0.93
CDS 246 212 212
Coding Density (%) 92.27 89.69 89.31
Annotated Function CDS 92 93 92
Unknown Function CDS 154 119 120
673 crAss-like genomes from Yutin et al., 2021 pharokka PHANOTATE Meta Mode pharokka Prodigal Meta Mode Prokka with PHROGs
Time (min) 106.55 11.88 252.33
Time Gene Prediction (min) 96.21 3.4 5.12
Time tRNA Prediction (min) 1.25 1.08 0.3
Time Database Searches (min) 6.75 5.58 238.77
CDS 138628 90497 89802
Contig Min Coding Density (%) 66.01 46.18 46.13
Contig Max Coding Density (%) 98.86 97.85 97.07
Annotated Function CDS 9341 9228 14461
Unknown Function CDS 129287 81269 75341

pharokka scales well for large metavirome datasets due to the speed of MMseqs2. In fact, as the size of the input file increases, the extra time taken is required for running gene prediction (particularly PHANOTATE) and tRNA-scan SE2 - the time taken to conduct MMseqs2 searches remain small due to its many vs many approach.

If you require fast annotations of extremely large datasets (i.e. thousands of input contigs), running pharokka with Prodigal (-g prodigal) is recommended.

Bugs and Suggestions

If you come across bugs with pharokka, or would like to make any suggestions to improve the program, please open an issue or email george.bouras@adelaide.edu.au.

Citation

George Bouras, Roshan Nepal, Ghais Houtak, Alkis James Psaltis, Peter-John Wormald, Sarah Vreugde, Pharokka: a fast scalable bacteriophage annotation tool, Bioinformatics, Volume 39, Issue 1, January 2023, btac776, https://doi.org/10.1093/bioinformatics/btac776

If you use pharokka, I would recommend a citation in your manuscript along the lines of:

With the following full citations for the constituent tools below where relevant: