OpenMS / pyopenms-docs

pyOpenMS readthedocs documentation, additional utilities, addons, scripts, and examples.
https://pyopenms.readthedocs.io
Other
42 stars 50 forks source link

Add information on hyperscore from one of our past lectures #364

Closed timosachsenberg closed 1 year ago

timosachsenberg commented 1 year ago

http://www.inf.fu-berlin.de/lehre/WS14/ProteomicsWS14/LUS/lu7b/427/index.html

timosachsenberg commented 1 year ago

X!Tandem Original paper: Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316.

The program itself and more can be found here: http://www.thegpm.org/tandem/ . OpenMS uses the latest release: SLEDGEHAMMER (2013.09.01). X!Tandem’s dot product To find overlapping masses, a maximal fragment mass tolerance window needs to be set (for ion traps this is usually 0.5 Da). X!Tandem reduces the experimental spectrum to only those peaks that match peaks in the theoretical spectrum and then calculates dot product (dp) by using ion intensities and the number of matching ions. dp=∑ni=0IiPi , where Ii are the fragment ion intensities from experimental spectrum, Pi are predicted or not present in the theoretical spectrum (Pi∈{0,1,}. ) Survival function and e-value Let x represent the dot product for the experimental spectrum S and the theoretical spectrum T. The score distribution p(x) is calculated from the frequency histogram (counts of PSMs per score bin) with f(x), the number of PSMs that are given the score x: p(x)=f(x)N , with N... total number of PSMs.

Then we can define the survival function in this context. The survival function, s(x), for a discrete stochastic score probability distribution, p(x) is defined as: s(x)=P(X>x)=∑X>xp(x), where P(X > x) is the probability to have a greater value than x by random matches in a database.

With the survival function s(x), we can calculate the E-value e(x), indicating the number of PSMs that are expected to have scores of x or better: e(x)=ns(x) , where n is the number of sequences. Each PSM in the output can be ranked according to e(x). X!Tandem Hyperscore The scoring scheme in X!Tandem is so-called hyperscore (HS). It is calculated by multiplying with factorials of the number of assigned b and y ions. The use of the factorials is based on the hypergeometric distribution that is assumed for matches of product ions. HS=(∑ni=0IiPi)Nb!Ny!

If p(x) is now plotted as a function of their log(hyperscores), the valid PSM is much better separated from the bulk of incorrect assignments (as shown below).

image

timosachsenberg commented 1 year ago

The HyperScore is a scoring scheme used in X!Tandem (Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316.) to evaluate the quality of peptide-spectrum matches (PSMs).

The score is composed of a sum of the log factorials of matched b and y ions (under assumption of a hypergeometric distribution – thus the name HyperScore - for random matches of fragment ions). The sum of matched intensities (= dot product of observed and theoretical intensities) is used as a tie-braker added to the log factorials to discriminate between potential many PSMs with same number of matched b and y ions.

In OpenMS, the hyperscore function expects a maximal fragment mass tolerance window, the error unit (m/z or ppm), the observed and the theoretical spectrum (as generated by e.g., TheoreticalSpectrumGenerator). The function then calculates and returns the HyperScore.

Details: In the original publication, an E-value is calculated based on the score distribution p(x), which is derived from a frequency histogram of PSMs per score bin, denoted as f(x). The total number of PSMs is represented by N. The formula for calculating the score distribution is: p(x) = f(x) / N

For a discrete stochastic score probability distribution p(x), the so-called survival function represents the probability of having a greater value than x by random matches in a database. The formula for the survival function is:

s(x) = P(X > x) = ∑X > x p(x)

To estimate the number of PSMs expected to have scores of x or better, one can calculate an E-value e(x) = n * s(x)

Here, n represents the number of sequences.

By ranking each PSM in the output according to its E-value, the significance of individual hits are taken into account. This functionality is currently not implemented in OpenMS.