bigbio / quantms

Quantitative mass spectrometry workflow.
MIT License
24 stars 10 forks source link

Test the impact and how the parameter num_hits works #344

Open ypriverol opened 6 months ago

ypriverol commented 6 months ago

Description of feature

Would be good to test for multiple datasets the impact of the parameter num_hits. The idea would be seen how this parameter will affect the identification step and the quant results.

daichengxin commented 6 months ago

LFQ PXD001819 and TMT PXD007683 were tested using different num_hits values (1, 2 and 3).

LFQ results: When num_hits increased, the number of PSMs reported by search engines would increase. But distribution of search engines scores has no obvious change. Target PSMs and decoy PSMs are both significantly increased from Comet and MSGF. But the increasing part are most worse PEP scores. So the final results are not improved when increasing num_hits. Even performance dropped a litte.

image image image image image

TMT results: showed consistent results with the LFQ. image image image image image

jpfeuffer commented 6 months ago

If you are using multiple hits, you probably want some more sophisticated consensus scoring. E.g. PEPMatrix that takes into account the similarities of the top_hits across SEs and allows some kind of reweighting based on the number of times a sequence "scaffold" was identified across multiple engines. No guarantees that it gets better though 😁

jpfeuffer commented 6 months ago

Could also be used during feature linking but we do not have an algorithm for that yet. So no short-term improvements possible there.

jpfeuffer commented 6 months ago

One thing that I am a bit surprised about is that it gets worse. If we are only taking the best PSM per spectrum, nothing should change by adding second-best hits. So maybe we are somewhere using more than just the best hit. If you upload a very small experiment, I can check it when I find time.