Benchmarking of the id dda workflow (ms2rescore, percolator, SNR)

bigbio / quantms

Quantitative mass spectrometry workflow. Currently supports proteomics experiments with complex experimental designs for DDA-LFQ, DDA-Isobaric and DIA-LFQ quantification.

https://quantms.org

MIT License

29 stars 35 forks source link

Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

Open ypriverol opened 3 weeks ago

ypriverol commented 3 weeks ago

PXD001819 Analysis

Currently, we have a workflow that can perform peptide identification using: -> ms2rescore -> SNR + spectrum properties -> percolator

Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD001819-id-ms2rescore/).

Total number of PMSs

Comet only + Percolator: 495306 Comet + MSGF + Percolator: 572496 (15.58% increase) Comet + MSGF + ms2rescore: 589200 (18.95% increase) Comet + MSGF + (SNR + ms2rescore): 587972 (18.71% increase) Comet + MSGF + SAGE + (SNR + ms2rescore): 592918 (19.68% increase)

psm_tools_plot

Total number of PSMs by RAW file and combination

psms_by_file_and_tool

Currently, the combination of ms2rescore alone has more PSMs identifications, followed by ms2rescore + SNR.

The following questions would be interesting to understand:

When the spectrum quality metrics are introduced, are the PSMs more high-quality meaning that while we have fewer PSMs for ms2rescore + SNR they have more quality than ms2rescore?
Do we see the same results in other datasets?
What is the impact at peptide level?

ypriverol commented 3 weeks ago

PXD014415

Currently, we have a workflow that can perform peptide identification using: -> ms2rescore -> SNR + spectrum properties -> percolator

Each of these combinations can be turned off. We used the dataset PXD014415 to benchmark the peptide identifications with some of the combinations:

Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD014415-id-ms2rescore/).

Combinations & PSMs counts:

Comet only + Percolator: 1401471
Comet + MSGF + Percolator: 1576657 (12.50% increase)
Comet + MSGF + ms2rescore: 1620560 (15.63% increase)
Comet + MSGF + (SNR + ms2rescore): 1617000 (15.38% increase)
Comet + MSGF + SAGE + (SNR + ms2rescore): 1646795 (17.50% increase)

psm_tools_plot

Total number of PSMs by RAW file and combination

psms_by_file_and_tool

Currently, the combination of ms2rescore (Comet + MSGF + SAGE) and SNR has more PSMs identifications.

jpfeuffer commented 3 weeks ago

What is "non-sage"? Comet?

ypriverol commented 3 weeks ago

Sorry the non-sage is COMET+MSGF

jpfeuffer commented 3 weeks ago

Sage comes on top or as replacement?

How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.

Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.

Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?

ypriverol commented 3 weeks ago

Sage comes on top or as replacement?

On top.

How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.

It is really fast, then there is no urgent need for improvements.

Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.

We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py

Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?

Im listening to suggestions. I would love to evaluate if this 5% increase in the PSMs in some way affects the FDR? Also, Im listening to suggestions on how to evaluate the difference between SNR+MS2rescore and MS2rescore. I have manually checked some IDs (in proteogenomics - https://www.biorxiv.org/content/10.1101/2024.05.24.595489v1) and I know that ms2rescore in the low-quality spectra can save (identified) some low-quality spectra that is the reason why we added the SNR. Would be nice to have some benchmark to prove it.

ypriverol commented 3 weeks ago

I was reading today the MSAmanda + ms2rescore and the % increase in PSMs is 6%.

RalfG commented 3 weeks ago

Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).

Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).

ypriverol commented 2 weeks ago

Thanks @RalfG for this response:

Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).

Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).

How do you test this? Distribution of the PEP scores or the original scores for targets and decoys?

RalfG commented 2 weeks ago

Usually just by plotting the amount of confidently identified PSMs at each FDR threshold, as in figure 1 of doi:10.1016/j.mcpro.2021.100076.

jonasscheid commented 2 days ago

I'm a bit curious about

We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py

Did you check the feature weights of percolator for this feature? I would guess that the Comet Xcorr implicitly penalizes for high SNR: https://willfondrie.com/2019/02/an-intuitive-look-at-the-xcorr-score-function-in-proteomics/

Would be great to see how high search engine scores / predicted features weights are in percolator!