Open ypriverol opened 3 weeks ago
Currently, we have a workflow that can perform peptide identification using: -> ms2rescore -> SNR + spectrum properties -> percolator
Each of these combinations can be turned off. We used the dataset PXD014415
to benchmark the peptide identifications with some of the combinations:
Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD014415-id-ms2rescore/).
Combinations & PSMs counts:
Currently, the combination of ms2rescore (Comet + MSGF + SAGE) and SNR has more PSMs identifications.
What is "non-sage"? Comet?
Sorry the non-sage is COMET+MSGF
Sage comes on top or as replacement?
How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.
Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.
Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?
Sage comes on top or as replacement?
On top.
How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.
It is really fast, then there is no urgent need for improvements.
Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.
We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py
Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?
Im listening to suggestions. I would love to evaluate if this 5% increase in the PSMs in some way affects the FDR? Also, Im listening to suggestions on how to evaluate the difference between SNR+MS2rescore and MS2rescore. I have manually checked some IDs (in proteogenomics - https://www.biorxiv.org/content/10.1101/2024.05.24.595489v1) and I know that ms2rescore in the low-quality spectra can save (identified) some low-quality spectra that is the reason why we added the SNR. Would be nice to have some benchmark to prove it.
I was reading today the MSAmanda + ms2rescore and the % increase in PSMs is 6%.
Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).
Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).
Thanks @RalfG for this response:
Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).
Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).
How do you test this? Distribution of the PEP scores or the original scores for targets and decoys?
Usually just by plotting the amount of confidently identified PSMs at each FDR threshold, as in figure 1 of doi:10.1016/j.mcpro.2021.100076.
I'm a bit curious about
We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py
Did you check the feature weights of percolator for this feature? I would guess that the Comet Xcorr implicitly penalizes for high SNR: https://willfondrie.com/2019/02/an-intuitive-look-at-the-xcorr-score-function-in-proteomics/
Would be great to see how high search engine scores / predicted features weights are in percolator!
PXD001819 Analysis
Currently, we have a workflow that can perform peptide identification using:
-> ms2rescore -> SNR + spectrum properties -> percolator
Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD001819-id-ms2rescore/).
Total number of PMSs
Comet only + Percolator: 495306 Comet + MSGF + Percolator: 572496 (15.58% increase) Comet + MSGF + ms2rescore: 589200 (18.95% increase) Comet + MSGF + (SNR + ms2rescore): 587972 (18.71% increase) Comet + MSGF + SAGE + (SNR + ms2rescore): 592918 (19.68% increase)
Total number of PSMs by RAW file and combination
The following questions would be interesting to understand: