Protegenomics analysis based on Pangenome references
The aim of this project is to search normal tissue proteomics datasets to identify novel proteins using the latest genome assemblies published via the PanGenome project.
Develop a workflow based on quantms to reanalyze public proteomics datasets with custom proteogenomics databases.
Develop a workflow that enables systematic validation of novel (non-canonical peptides) using multiple existing tools.
Performing a comprehensive analysis of multiple normal
tissue datasets from public domain using databases generated from the latest Pangenome assemblies.
Provide a fasta database with all the novel proteins observed.
Draft manuscript layout and sections.
Workflow components:
For the spectrum identification, the following python script is used - ms2pip_novel.py.
ms2pip_novel.py
contains a series of functions that together help create an MGF file from peptide data, run MS2PIP predictions, and compute additional metrics for each spectrum such as signal-to-noise ratio, number of peaks, and difference between the highest and lowest peaks.
Here's a brief overview of the main components of the code:
create_mgf
: The command function that creates an MGF file from a peptide file and MGF file. It reads mzML files (either locally or from an FTP server) and uses the read_spectra_from_mzml function to parse spectra. The function then writes the spectra to an MGF file.run_ms2pip
: The command function that runs MS2PIP predictions on a given peptide and MGF file. It merges predictions with the original data and saves the results to an output file.filter-ms2pip
: The command function that runs the MS2PIP filtering process to remove low-quality peptides based on certain thresholds. It filters peptides with a sequence length below a specified threshold and then dynamically sets thresholds based on percentiles for a signal-to-noise ratio.These functions and command-line commands together facilitate the process of working with peptide and MGF data files, running predictions using MS2PIP, and filtering and computing metrics for the resulting spectra.
The spectrumAI algorithm was originally published in Nature Communication by Yafeng et al. and it was implemented originally in R. We implemented the algorithm in Python in the toolbox pypgatk enabling faster running of the algorithm and also integration in other Python workflows. The explanation of the original algorithm:
Assume a 12-amino-acid peptide is identified with single substitution at 8th residue, in order to pass SpectrumAI, it must have matched MS2 peaks (within fragment ion mass tolerance) from at least one of the following groups: b7&b8, y4&y5, y4&b7 or y5&b8. Second, the sum intensity of the supporting flanking MS2 ions must be larger than the median intensity of all fragmentation ions. An exception to these criteria is made when the substituted amino acid has a proline residue to its N-terminal side. Because CID/HCD fragmentation at the C-terminal side of a proline residue is thermodynamically unfavored, SpectrumAI only demands the presence of any b or y fragment ions containing substituted amino acids, in this case, b8 to b11, y5 to y11.
Using DeepLC, the script deeplc_novel.py
is designed to evaluate the performance of DeepLC on the novel peptides. It uses canonical peptides (e.g. GRCh38 peptides) for training DeepLC and novel peptides peptides to evaluate its performance and filter them.
We used two big normal tissue datasets to detect novel peptides from pangenomes and to validate the results. The datasets are:
py-pgatk
package is used to generate protein sequences for samples provided by Liao et al. in the Human Pangenome Reference Consortium. The steps and detailed infromation about the database generation can be accessed through the db_generation notebook.cdna
and ncrna
files from ENSEMBL release-110 were included in the database.The original PSMs are stored in quantms.io format.