phold
is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology.
phold
uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a database of over 1 million phage protein structures mostly predicted using Colabfold.
Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 using the parameters --structures
and --structure_dir
with phold compare
.
Benchmarking is ongoing, but phold
strongly outperforms Pharokka, particularly for less characterised phages such as those from metagenomic datasets.
The below plot shows the percentage of annotated coding sequences (CDS) for 179 metagenomic phage genomes assembled with phables. Phold v0.2.0 run both in default settings (with ProstT5) settings and where predicted protein structures (with Colabfold) were compared against Pharokka v1.7.0.
If you have already annotated your phage(s) with Pharokka, phold
takes the Genbank output of Pharokka as an input option, so you can easily update the annotation with more functional predictions!
Check out the phold
tutorial at https://phold.readthedocs.io/en/latest/tutorial/.
If you don't want to install phold
locally, you can run it without any code using one of the following Google Colab notebooks:
pharokka
+ phold
+ phynteny
use this link
phold
. phold
if your phage(s) are too big - just don't run the Phynteny step!Check out the full documentation at https://phold.readthedocs.io.
For more details (particularly if you are using a non-NVIDIA GPU), check out the installation documentation.
The best way to install phold
is using mamba, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies.
To install phold
using mamba:
mamba create -n pholdENV -c conda-forge -c bioconda phold
To utilise phold
with GPU, a GPU compatible version of pytorch
must be installed. By default conda/mamba will install a CPU-only version.
If you have an NVIDIA GPU, please try:
mamba create -n pholdENV -c conda-forge -c bioconda phold pytorch=*=cuda*
If you have a Mac running an Apple Silicon chip (M1/M2/M3), phold
should be able to use the GPU. Please try:
mamba create -n pholdENV python==3.11
conda activate pholdENV
mamba install pytorch::pytorch torchvision torchaudio -c pytorch
mamba install -c conda-forge -c bioconda phold
If you are having trouble with pytorch
see this link for more instructions. If you have an older version of CUDA installed, then you might find this link useful.
Once phold
is installed, to download and install the database run:
phold install
phold
databases including ProstT5 are just over 8GB uncompressed).phold
takes a GenBank format file output from pharokka or from NCBI Genbank as its input by default. phold
on a local work station with GPU available, using phold run
is recommended. It runs both phold predict
and phold compare
phold run -i tests/test_data/NC_043029.gbk -o test_output_phold -t 8
If you do not have a GPU available, add --cpu
.
phold run
will run in a reasonable time for small datasets with CPU only (e.g. <5 minutes for a 50kbp phage).
However, phold predict
will complete much faster if a GPU is available, and is necessary for large metagenomic datasets to run in a reasonable time.
In a cluster environment, it is most efficient to run phold
in 2 steps for optimal resource usage.
phold predict
. This is massively accelerated if a GPU available.phold predict -i tests/test_data/NC_043029.gbk -o test_predictions
phold
structure database with Foldseek using phold compare
. This does not utilise a GPU. phold compare -i tests/test_data/NC_043029.gbk --predictions_dir test_predictions -o test_output_phold -t 8
phold_3di.fasta
containing the 3Di sequences for each CDSphold_per_cds_predictions.tsv
containing detailed annotation information on every CDSphold_all_cds_functions.tsv
containing counts per contig of CDS in each PHROGs category, VFDB, CARD, ACRDB and Defensefinder databases (similar to the pharokka_cds_functions.tsv
from Pharokka)phold.gbk
, which contains a GenBank format file including these annotations, and keeps any other genomic features (tRNA, CRISPR repeats, tmRNAs) included from the pharokka
Genbank input file if providedUsage: phold [OPTIONS] COMMAND [ARGS]...
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
Commands:
citation Print the citation(s) for this tool
compare Runs Foldseek vs phold db
createdb Creates foldseek DB from AA FASTA and 3Di FASTA input...
install Installs ProstT5 model and phold database
plot Creates Phold Circular Genome Plots
predict Uses ProstT5 to predict 3Di tokens - GPU recommended
proteins-compare Runs Foldseek vs phold db on proteins input
proteins-predict Runs ProstT5 on a multiFASTA input - GPU recommended
remote Uses Foldseek API to run ProstT5 then Foldseek locally
run phold predict then comapare all in one - GPU recommended
Usage: phold run [OPTIONS]
phold predict then comapare all in one - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or
nucleotide FASTA format [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
--batch_size INTEGER batch size for ProstT5. 1 is usually fastest.
[default: 1]
--cpu Use cpus only.
--omit_probs Do not output 3Di probabilities from ProstT5
--finetune Use finetuned ProstT5 model (PhrostT5).
Experimental and not recommended for now
--finetune_path TEXT Path to finetuned model weights
--save_per_residue_embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save_per_protein_embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
-e, --evalue FLOAT Evalue threshold for Foldseek [default:
1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep_tmp_files Keep temporary intermediate files,
particularly the large foldseek_results.tsv
of all Foldseek hits
--card_vfdb_evalue FLOAT Stricter Evalue threshold for Foldseek CARD
and VFDB hits [default: 1e-10]
--separate Output separate GenBank files for each contig
--max_seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 10000]
--only_representatives Foldseek search only against the cluster
representatives (i.e. turn off --cluster-
search 1 Foldseek parameter)
--ultra_sensitive Runs phold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
phold plot
will allow you to create Circos plots with pyCirclize for all your phage(s). For example:
phold plot -i tests/test_data/NC_043029_phold_output.gbk -o NC_043029_phold_plots -t '${Stenotrophomonas}$ Phage SMA6'
phold
is a work in progress, a preprint will be coming soon - if you use it please cite the GitHub repository https://github.com/gbouras13/phold for now.
Please be sure to cite the following core dependencies and PHROGs database:
Please also consider citing these supplementary databases where relevant:
phold
Harutyun Sahakyan, Kira S. Makarova, and Eugene V. Koonin. Search for Origins of Anti-CRISPR Proteins by Structure Comparison. The CRISPR Journal (2023)