MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
72 stars 36 forks source link

A question about the FDR and Q-Value of the MS-GF+ output #131

Open Sweetsour-crap opened 2 years ago

Sweetsour-crap commented 2 years ago

I have some questions about the “QValue” in the output report of the software.

In the documentation, you mentioned that “QValue is defined as the minimum false discovery rate (FDR) at which the test may be called significant”. But in the formula of the QValue, the documentation says:

And I found one plot of the relationship of FDR and Qvalue taken from another paper, which is attached below. (From “Käll, L., Storey, J. D., MacCoss, M. J., & Noble, W. S. (2008). Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of proteome research, 7(1), 29–34.”) fdr and qvalue

So I want to know if my understanding of FDR and QValue is right. If so, could you tell me that which parameter you are using in the report of MS-GF+ exactly? Is it FDR or QValue? If my understating is wrong, I also want to know the exact interpretation of QValue in the output report.

I searched the website documentation and two papers of MS-GF+ but still can not get an answer. So I decide to bother you for the question and I am sorry for taking your precious time.

FarmGeek4Life commented 2 years ago

As for the exact definition of "QValue" as used in MS-GF+, I can't really give a precise answer, but in terms of behavior, the MS-GF+ QValue does exhibit very similar behavior to the QValue shown in that image, but it is based on the FDR ratio also mentioned in that paper. I believe the MS-GF+ QValue may be calculated in a manner matching the "estimated QValue" mentioned there.

Specifically: to compute the MS-GF+ QValue, the target and decoy results are combined, and only the PSM with the highest SpecEValue for each spectrum is kept. These are then sorted from best to worst SpecEValue, and FDR is calculated for each of these PSMs (at that SpecEValue threshold). Then the FDR values for all PSMs are processed to create the QValue, which is the highest computed FDR value that exists among the assigned PSM and any PSM with a better SpecEValue.

If you want to try to understand the calculations in the code yourself, the following files/lines are good places to start: https://github.com/MSGFPlus/msgfplus/blob/master/src/main/java/edu/ucsd/msjava/fdr/ComputeFDR.java#L272 https://github.com/MSGFPlus/msgfplus/blob/master/src/main/java/edu/ucsd/msjava/fdr/TargetDecoyAnalysis.java#L160

alchemistmatt commented 2 years ago

I would treat Q Value as an estimate of FDR. There is no such thing as absolute FDR in proteomics. To see how Q Value is calculated, please download these Excel files:

The formulas in those files show how Q Value is computed.