Need help on learning to run MS2PIP

compomics / ms2pip

MS²PIP: Fast and accurate peptide spectrum prediction for multiple fragmentation methods, instruments, and labeling techniques.

Apache License 2.0

35 stars 18 forks source link

Hi,

I am new to mass spec analysis and would like to use MS2PIP to improve protein predictions. My main goal is to identify small proteins in a mass spec dataset.

Currently, what I have done is to search mgf files against a database of Uniprot annotated proteins (H. sapiens). I then searched the resulting unmatched spectra against a database of small proteins. This outputs a number of predictions of small proteins. All of the searches were performed on PeptideShaker. As the MS experiment was not optimised for small proteins, the small peptide predictions are naturally, of low/doubtful confidence. Hence, I would like to see if using MS2PIP could improve the prediction quality.

I am a little confused as to where to start, however. I understand MS2PIP requires a PEPREC file to run. I generated a PEPREC of small proteins that I am interested in (~40,000 small proteins). This was done using the fasta2PEPREC.py script in the conversion_tools folder. I am not sure if I did this correctly as the resulting PEPREC file does not contain any amino acid mods (e.g. oxidation of M, carbamidomethylatino of C). How do I generate a PEPREC file properly containing AA modifications?

Having generated the PEPREC file, I then ran ms2pip and this outputs a HCD_predictions.csv file. I am stuck here as I don't know how to proceed to get improved protein predictions. Am I using the right workflow i.e. should I be starting from the protein database in the first place or should I start from the output predictions from PeptideShaker?

Hi! You've come to the right place! MS2 spectrum predictions can give a boost in sensitivity to challenging identification workflows (see https://doi.org/10.1093/bioinformatics/btz383, and https://doi.org/10.1002/pmic.201900351). The easiest and most versatile way to make use of MS²PIP to improve your identification workflow is with MS²ReScore. I noticed your issue over there (https://github.com/compomics/ms2rescore/issues/11), so I'll help you out in that issue thread.

To clarify the use cases for our MS²PIP-related tools:

MS²PIP only predicts spectra, so for a given list of (modified) peptide sequences and a charge state, it will output a predicted spectrum. These spectra can be used directly for a number of use cases, for instance to manually compare with and inspect important peptide spectrum identifications (e.g. https://doi.org/10.1038/s41586-019-1555-y). This is, of course, not a very high throughput work method.
Fasta2SpecLib is a sort of wrapper around MS²PIP that takes not a peptide list as input, but a protein fasta file. It will then in silico digest those proteins and generate an MS²PIP-predicted spectral library to be used in spectral library searching, or as a reference library for DIA identifications (see https://doi.org/10.1002/pmic.201900306).
MS²ReScore enables you to use MS²PIP, and recently also DeepLC, to rescore peptide identifications by adding additional information to the Percolator input. This leads to a big boost in sensitivity, usually leading to more identifications at a more conservative false discovery rate threshold (https://doi.org/10.1093/bioinformatics/btz383).

compomics / ms2pip

Need help on learning to run MS2PIP #89