Arcadia-Science / peptigate

Peptigate ("peptide" + "investigate") predicts bioactive peptides from transcriptome assemblies or sets of proteins.
MIT License
0 stars 0 forks source link

peptigate

run with conda Snakemake

Purpose

Peptigate is a workflow that predicts bioactive peptides from transcriptome assemblies and annotates those predictions. Its name is an abbreviated portmanteau of "peptide" and "investigate" -> peptigate. The workflow predicts peptides (small open reading frames, cleavage peptides, and ribosomally translated and post-translationally modified) and then annotates them.

For more information, see the pub, "Predicting bioactive peptides from transcriptome assemblies with the peptigate workflow.".

Installation and Usage

Currently, the peptigate repository needs to be cloned to use the workflow.

git clone https://github.com/Arcadia-Science/peptigate.git
cd peptigate

This repository uses Snakemake to run the pipeline and conda to manage software environments and installations. You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following command to create the pipeline run environment.

mamba env create -n peptigate --file envs/dev.yml
conda activate peptigate

Snakemake manages rule-specific environments via the conda directive using environment files in the envs/ directory. Snakemake itself is installed in the main development conda environment as specified in the dev.yml file.

To start the pipeline on the demo data set, run:

snakemake --software-deployment-method conda -j 1 --configfile demo/config.yml

Input data

The peptigate pipeline requires three input files:

The pipeline also requires paths to three directories:

These inputs are provided to the peptigate pipeline by a config file. The demo config.yml is an example config file, while the demo directory contains example input files. We also provide a blank config.yml file as a template.

Many of the steps in the workflow require models or databases from external sources. These data are either included in the inputs folder in this repository or are downloaded by the snakemake pipeline itself.

Protein-only input mode

We also included a workflow that can take a single file of protein sequences as input. The demo config_protein_as_input.yml is an example config file for this workflow.

Overview

Peptigate is a workflow that predicts bioactive peptides from transcriptome assemblies and annotates those predictions. This Snakemake-based pipeline integrates tools to identify small open reading frames (sORFs) and cleavage peptides. Each peptide prediction is then annotated to provide clues as to the bioactivity or function of the peptide.

Description of how the tool works

The peptigate pipeline is organized into three sections.

Small open reading frame (sORF) prediction

Background. Small open reading frames are peptides that are "born small" -- they are less than 100 amino acids but are otherwise like longer proteins in that they are synthesized via DNA transcription and ribosomal translation. Many tools that predict open reading frames (ORFs) in transcripts have decreased accuracy at shorter lengths and by default do not output predictions shorter than 100 amino acids. Yet, some sORFs produce functional proteins, leading to a systematic underappreciation in the detection and biological role of these proteins. Techniques like ribosomal profiling and peptidomics mass spectrometry have highlighted the ubiquity of these proteins as well as some of their biological roles. Many sORFs are located upstream or downstream of canonical long ORFs and play a regulatory role by influencing translation. Other sORFs encode functional peptides.

What the pipeline does. The peptigate pipeline targets stand-alone sORFs with the goal of identifying functional peptides. Peptigate begins sORF prediction by removing transcripts that had predicted ORFs. It then uses the plm-utils tool to predict open reading frames from the rest of the transcripts. plm-utils uses canonical and non-canonical start codons (TTG, CTG, ATG, GTG, ACG) with traditional stop codons for ORF prediction as sORFs frequently use non-canonical start sites. If a predicted ORF is shorter than 301 nucleotides, it then predicts whether the ORF is coding using a model trained on ESM embeddings. We think these embeddings capture information about the secondary structure of proteins; large protein language models have previously been shown to have high accuracy on structural predictions for some peptides. Peptigate returns the peptide sequences of predicted sORFs in amino acid and nucleotide format.

Observations from running the pipeline. Because many valid sORFs are co-encoded on transcripts with longer ORFs, peptigate will over-predict functional sORFs when fragmented transcripts are supplied to the pipeline. This is because peptigate will detect sORFs encoded in the 5' or 3' UTR of a longer canonical ORF if the canonical ORF was not annotated because it is part of a fragmented contig. We implemented a filtering step to try and catch some of these cases but this is a tricky problem for which we don't have a great solution. If this is your situation, two tricks that might prove useful to filter predictions down to the most likely functional peptides are to:

  1. Search for sORF predictions with both chain and signal peptides annotated. If a peptide has both, it may be more likely to be targeted to a specific location, potentially indicating that it is more likely to be functional.
  2. Filter to sORF predictions that have hits against the peptipedia database. This will limit the predicted sORFs to those that are homologous to peptides that have been discovered before.

Cleavage peptide prediction

Background. Cleavage peptides are generated by enzymatic cleavage (proteolysis) of precursor proteins. These peptides are initially ribosomally translated while embedded in the precursor protein and then are cleaved to become biologically active. Examples include preproglucagon, a precursor protein cleaved into peptides that regulate insulin secretion in pancreatic beta cells.

What the pipeline does. The peptigate pipeline predicts two categories of cleavage peptides: traditional cleavage peptides and ribosomally synthesized and post-translationally modified peptides (RiPPs). For traditional cleavage peptides, peptigate runs the DeepPeptide tool. DeepPeptide predicts the presence of both cleavage peptides and propeptides, where peptides are thought to be biologically active once cleaved and propeptides are not. For RiPP peptides, peptigate runs the NLPPrecursor tool. NLPPrecursor predicts the cleavage site and the RiPP type (for example, "lasso" peptides). For both classes, peptides are predicted from input protein sequences. Peptigate returns the peptide sequences of predicted peptides and precursor (parent) proteins in amino acid and nucleotide formats.

Observations from running the pipeline. The NLPPrecursor models that predict RiPP peptides were trained exclusively on bacterial data. While Eukaryotes have RiPP peptides, it's not clear how similar in structure these RiPP peptides are to those used to train the NLPPrecursor model. Even still, we found good support for these peptides in orthogonal experimental evidence. We think it's possible that the RiPP peptides detected were once horizontally transferred from bacteria to Eukaryotes; however, we have not followed up on this hypothesis.

Predicted peptide annotation

What the pipeline does. The peptigate pipeline uses multiple approaches to get clues as to the function of the predicted peptides. For all predicted peptides, it predicts whether the peptide contains a signal peptide using DeepSig, compares the peptide against known peptides in the Peptipedia database, and performs functional annotation against 16 pre-trained models using AutoPeptideML.

Description of the folder structure and files

Folders and files in this repository

Folders and files output by the workflow

All predicted peptide sequences and annotation information are reported in the predictions/ subfolder of the output folder specified in config.yml. Other folders record intermediate files needed to make the final prediction files. See below for a description of each folder.

Intermediate folders and files

Compute Specifications

We ran the pipeline on an AWS EC2 instance type g4dn.2xlarge running AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240122 (AMI ID ami-07eb000b3340966b0). The tools plm-utils, DeepPeptide, NLPPrecursor, and AutoPeptideML can use GPUs so compute times will be substantially faster (hours vs. days) on a GPU than a CPU. We did not test the pipeline on a CPU so there is a chance it will only work on a GPU.

Contributing

See how we recognize feedback and contributions to our code.