francois-a / fastqtl

Enhanced version of the FastQTL QTL mapper
GNU General Public License v3.0
56 stars 22 forks source link

Deriving beta-adjusted p-values from nominal p-value and shape parameters #20

Open kwcurrin opened 2 years ago

kwcurrin commented 2 years ago

Hello,

Is it possible to obtain the beta-approximated p-values using the nominal p-value and the two shape parameters using the pbeta function in R? I tried this for my data, but the p-values don't match. For example: pbeta(0.00702837, 0.976548, 44.2766) gives a value of 0.2788137, but the beta-adjusted p-value is listed as 0.342138. Do I need to set something different in the R function to get it to match?

I noticed this issue because when I tried calculating all significant gene-variant pairs using the pval_beta threshold corresponding to FDR<5% and calculating nominal p-value thresholds per gene with qbeta, one gene dropped off because its best nominal p-value threshold was above the nominal p-value threshold derived using qbeta. I think this could be related to the difference in beta-adjusted p-values described above.

Do you have any suggestions for adjusting the pbeta or qbeta function parameters to get the results to match?

Thanks,

Kevin

kwcurrin commented 2 years ago

Hello,

A colleague found out this issue results from different df values used in calculation of pval_beta and pval_nominal in analysisPermutation.cpp. The pval_nominal reported in the fastQTL output is calculated using "df" (sample size - 2 - number of covariates), whereas when pval_beta is calculated, the pval_nominal is initially calculated using true_df, which is a learned value, and this is then converted to pval_beta. Both of us confirmed that if we calculate a nominal p-value from the reported Pearson correlation using true_df instead of df and then use this updated pval_nominal in pbeta(pval_nominal,shape1,shape2), the value we get for pval_beta matches pval_beta reported in the fastQTL output.

This difference in df values used results in an issue when calculating all significant variant-gene pairs using qbeta to get the nominal threshold per gene. Although true_df is often smaller than df, sometimes true_df is sufficiently larger than df such that the nominal p-value threshold calculated using qbeta will be smaller than the best variant for that gene, even though the gene met the FDR<5% threshold using pval_beta calculated from true_df. In my case, this only happened for one gene.

Another potential issue is that when true_df is much smaller than df, some variants that don't actually meet the gene-level nominal threshold will be classified as significant because the reported pval_nominal is more significant because it was calculated with df. If true_df is larger, less variants will pass the nominal threshold.

I also noticed that sometimes true_df is equal to or slightly higher than sample size. This is rare, but still happens.

Would you be able to advise on whether it is better to use df or true_df?

Thanks,

Kevin

francois-a commented 2 years ago

That's correct, you need to use true_df for this calculation (see here). But the reason for a gene missing from the significant pairs list is likely due to the calculation of the threshold, discussed here.

kwcurrin commented 2 years ago

Hi Francois,

Thank you for your response. It appears that annotate_outputs.py compares the nominal p-values (calculated with df) to the nominal p-value threshold (calculated with true_df). Is this correct?

Would you be able to point me to a description of what "true_df" means conceptually? I looked through the c++ code for learnDF, but was unable to understand exactly what the function is doing. What is the benefit of using true_df vs. df?

Thanks!

Kevin