MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
73 stars 36 forks source link

Spectral DAG model question #55

Closed mrForce closed 5 years ago

mrForce commented 5 years ago

This isn't a bug, but I have some questions about the inner-workings of MS-GF+ database search, and I figured this would be the best place to ask. If it's not, please direct me elsewhere. Looking at one of the recent MSGF+ related publications (see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5036525), it looks like MSGF+ computes a spectral DAG G for each spectrum, and then computes log Pr(G | P)/Pr(G | 0) for each spectrum-peptide pair (in which the spectrum and peptide have the same precursor mass), and where Pr(G | P) is computed using parameters derived from the training data. Is this still correct? And is the MS-GF:RawScore field in the mzid output this value? Is there anyway to recover Pr(G | P) by itself?

Thanks,

Jordan

sangtaekim commented 5 years ago

Hi @mrForce,

Is this still correct? Yes. And is the MS-GF:RawScore field in the mzid output this value? Yes. Is there anyway to recover Pr(G | P) by itself? No. MS-GF+ doesn't calculate Pr(G|P) and Pr(G|O) separately because calculating Pr(G|P)/Pr(G|O) is simpler and faster.

Best,

Sangtae

mrForce commented 5 years ago

Thanks for getting back to me so quickly! Do you think it would be feasible to modify MS-GF+ to compute (and output) Pr(G | P)?

sangtaekim commented 5 years ago

Theoretically, you can do it, but it will be a lot of work. Thanks to Pr(G|O), MS-GF+ calculates Pr(Gi|Pi) only when Pi=1 (because Pr(Gi|Pi=0)/Pr(Gi|O) = 1), which reduces the #calculations by about 100x. Even though you manage to calculate Pr(G|P), it will make MS-GF+ significantly slower.

mrForce commented 5 years ago

Thanks!

mrForce commented 5 years ago

I closed this since my original question was answered, but I looked at the MSGF+ code, and I have some more questions. I ran a debugger, and it appears that the FastScorer class is used for computing the scores. However, the comment above the declaration of FastScorer says that it ignores edges. The DBScanScorer extends FastScorer, but does consider edges. It looks like MSGF+ uses FastScorer by default (when I ran it, the DBScanScorer constructor was never called); why is this?

It looks like all that would be needed to change MSGF+ to compute Pr(S = s | P = p) would be to change the getScore function in the FastScorer class to sum across all integer masses, instead of just ion's that are in the spectra, and also change the getScore function in DBScanScorer. But, since you said it would be a lot of work, I'm assuming that I'm missing something here?

sangtaekim commented 5 years ago

The spectra DAG model was introduced to take advantage of high-precision data and thus DBScanScorer is used process high precision spectra (e.g. -inst 1) only. For low-precision spectra, the simpler and faster FastScorer is still used.

Regarding the second paragraph, yes, you just need to sum across all integer masses. I've written that code long ago, and my estimation of the amount of work can be very wrong.

mrForce commented 5 years ago

Ah, okay, that makes sense. Thanks. Alternatively, I could probably just compute Pr(S | 0) for each spectra, and then just add it to the MSGF+ score to get approximately log(Pr(S | P))

By the way, with respect to your scoring function log(Pr(S | P)/Pr(S | 0)), was the intention here to approximate the posterior probability? I ask, because if we treat Pr(S | 0) as Pr(S), and assume all peptides are equally likely, then Pr(S | P)/Pr(S | 0) is proportional to Pr(P | S) (which seems like the most sensible scoring function to me).

sangtaekim commented 5 years ago

No, it's a likelihood ratio. Adding Pr(S|0) also makes the "raw" score to be better calibrated across peptide-spectrum matches from multiple spectra.

MS-GF+ reports the SpecEValue (sort of an E-value) instead of posterior probabilities. The raw score is just an internal score used to calculate SpecEValue.

mrForce commented 5 years ago

Ah, okay. When using Percolator (which, in a nutshell, uses an SVM to re-score matches), I've been inputting both E values and raw scores (among other values). Do you think I would get better performance if I switched to using just the E-values?

sangtaekim commented 5 years ago

No. Have you read https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3975676/?

mrForce commented 5 years ago

Yeah, I did, a long time ago, but I forgot about the Feature Analysis (and table #3) part.