bigbio / pgt-pangenome

Protegenomics analysis based on Pangenome references
BSD 2-Clause "Simplified" License
1 stars 0 forks source link
mass-spectrometry non-canonical pangenome protein proteogenomics proteomics public-proteomics quantms variants

Pangenome Proteogenomics

Protegenomics analysis based on Pangenome references

The aim of this project is to search normal tissue proteomics datasets to identify novel proteins using the latest genome assemblies published via the PanGenome project.

Project Aims

Proteogenomics workflow

alt text

Workflow components:

Spectrum identification validation

For the spectrum identification, the following python script is used - ms2pip_novel.py.

ms2pip_novel.py contains a series of functions that together help create an MGF file from peptide data, run MS2PIP predictions, and compute additional metrics for each spectrum such as signal-to-noise ratio, number of peaks, and difference between the highest and lowest peaks.

Here's a brief overview of the main components of the code:

These functions and command-line commands together facilitate the process of working with peptide and MGF data files, running predictions using MS2PIP, and filtering and computing metrics for the resulting spectra.

Variant annotation

The spectrumAI algorithm was originally published in Nature Communication by Yafeng et al. and it was implemented originally in R. We implemented the algorithm in Python in the toolbox pypgatk enabling faster running of the algorithm and also integration in other Python workflows. The explanation of the original algorithm:

Assume a 12-amino-acid peptide is identified with single substitution at 8th residue, in order to pass SpectrumAI, it must have matched MS2 peaks (within fragment ion mass tolerance) from at least one of the following groups: b7&b8, y4&y5, y4&b7 or y5&b8. Second, the sum intensity of the supporting flanking MS2 ions must be larger than the median intensity of all fragmentation ions. An exception to these criteria is made when the substituted amino acid has a proline residue to its N-terminal side. Because CID/HCD fragmentation at the C-terminal side of a proline residue is thermodynamically unfavored, SpectrumAI only demands the presence of any b or y fragment ions containing substituted amino acids, in this case, b8 to b11, y5 to y11.

Retention time prediction

Using DeepLC, the script deeplc_novel.py is designed to evaluate the performance of DeepLC on the novel peptides. It uses canonical peptides (e.g. GRCh38 peptides) for training DeepLC and novel peptides peptides to evaluate its performance and filter them.

Pangenome reanalysis of normal tissue datasets

Datasets of normal tissues

We used two big normal tissue datasets to detect novel peptides from pangenomes and to validate the results. The datasets are:

Database information

Results from analysis

The original PSMs are stored in quantms.io format.

Structure of the repository

Filtering scripts and notebooks

Other notebooks during the analysis

Other files generated during the analysis

Authors