fgcz / prolfqua

Differential Expression Analysis tool box R lang package for omics data
https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.2c00441
MIT License
40 stars 7 forks source link

add ropeca type protein level p-value calculation #7

Closed wolski closed 3 years ago

wolski commented 5 years ago

For protein-level inference of differential expression, the median of peptide-level p-values is used as a score for each protein taking the direction of change into account. The protein-level significance of the detection is then calculated using beta distribution. Under the null hypothesis, the p-values of the peptides follow the uni-form distribution U(0,1). Furthermore, the order statistics from U(0,1) distribution follow a beta distribution. More specifically, the i th order statistic of sample size n has a beta distribution B(gamma,delta) with parameters gamma = i and delta = n − i + 1 . The significance of the median p-value for a protein with n peptides is hence calculated using the cumulative distribution function of the beta distribution’s probability density function.

where P_m is the observed median p-value of peptides belonging to the protein. Finally, the FDR is calculated using the Benjamini-Hochberg procedure.

wolski commented 5 years ago

Dear Laura,

Hope you had a great weakend.

I am trying to understand some details from the article PMC5517573, specifically:

"For protein-level inference of differential expression, the median of peptide-level p-values is used as a score for each protein taking the direction of change into account. "

How the direction of change is taken into account by taking the median of the p-value?

Are you referring to the median p-value or are you referring to the p-value of the peptide with the median of the effect size? median of peptide-level p-values clearly points to the former but if you want to take the direction of the change into the account to the latter (because how otherwise the direction of change is being taken into account?).

Furthermore, could you confirm that for the i-th order statistics being the median gamma = delta and gamma = N/2.

Thank you in advance for your help. Have a great evening best regards Witek

wolski commented 5 years ago

I did contact you a few days ago (10) asking about the following sentence.

"For protein-level inference of differential expression, the median of peptide-level p-values is used as a score for each protein taking the direction of change into account. " What do you mean by direction of the change? I interpret it as the peptide level fold change. But how do you take it into account? Based on this sentence I can not work out which p-value is taken given for instance this example:

peptide level Fold Change : -2, -1, -1, -0.5, 1, 2, 3 peptide level p - Value : 0.01, 0.05, 0.1, 0.5, 0.1, 0.05, 0.01

And I do observe for many proteins, contradicting peptide fold changes.

Sorry for bothering again but a brief answer would be most helpful.

Best regards Witek

wolski commented 5 years ago

Dear Witek,

This has been worded a bit poorly in the manuscript, but the idea is to order the peptide-level statistics from most significantly downregulated to most significantly upregulated. In practice, either t-statistic is used as such or p-values are scaled based on the sign of fold change. The median of them becomes the initial score. Thus, the more agreement there is on peptide-level, the higher the changes are that the score is better (i.e. the "median" of p-values is smaller).

To follow your example with p-values: LogFC: -2, -1, -1, -0.5, 1, 2, 3 p-values: 0.01, 0.05, 0.1, 0.5, 0.1, 0.05, 0.01 scaled p: -0.99, -0.95, -0.9, -0.5, 0.9, 0.95, 0.99

Median of scaled values is -0.5, thus the p-value to be used as initial score is 0.5.

I hope this clarifies.

Best, Tomi

wolski commented 3 years ago

ROPECA implemented