Closed xiyasong closed 2 years ago
Hello,
I had a similar issue and a colleague helped me figure it out. The nominal p-value reported by fastQTL was calculated using a different value for degrees of freedom (df) than the beta-derived p-value. The reported nominal p-value was calculated using df = sample size - 2 - number of covariates, whereas for the beta-derived p-value, the nominal p-value was first calculated using true_df (which usually doesn't match df and is different for each gene), and then this nominal p-value was adjusted using the beta distribution. If true_df is sufficiently larger than df for a given gene, then the nominal threshold for that gene, calculated from the beta-adjusted p-value using qbeta, may be more significant than the most significant nominal p-value reported for that gene.
Can you confirm if the genes that are significant in genes.txt but not allpairs.txt have a true_df value that is larger than sample size - 2 - number of covariates?
Thanks,
Kevin
How many genes are missing from your signif_variant_gene_pairs.txt file? In some cases, genes close to the FDR cutoff might get excluded due to how the threshold is computed here.
Hi, @kwcurrin and @francois-a Thank you all for helping and explaining here.
@kwcurrin In my result, the genes that are significant in genes.txt (I filtered out qval <0.05) but not signif_pairs.txt (I noticed you mentioned allpairs.txt but I am using signif_pairs.txt )indeed all have a true_df value that is larger than sample size - 2 - number of covariates (In my case 73 ). But I found that some of those genes that appear in both files also have true_df value that is larger than sample size - 2 - number of covariates.
table(my_egenes_q0.05$true_df >=73)
FALSE TRUE
600 205
table(diff_table$true_df >=73)
TRUE
17
@francois-a Hi Francois, there are 17 genes missing from signif_variant_gene_pairs.txt(I put them in diff_table
, and I put the common genes shown both in two files as common_table
.
table(diff_table$pval_nominal >= diff_table$pval_nominal_threshold)
TRUE
17
table(common_table$pval_nominal >= common_table$pval_nominal_threshold)
FALSE
788
So that's pval_nominal >= pval_nominal_threshold is the reason for this situation. I think it is similar to what you mentioned about the computed threshold. I put this reply here in case someone has similar concerns.
Hi xiyasong,
Sorry, I should have clarified. It is possible for a gene in both files to also have true_df > than df because this is only an issue really when the nominal p-value is close to the threshold. When true_df is larger than df, the reported nominal p-value calculated with df will be larger than the p-value calculated with true_df. But if the p-value is very significant (much smaller than the threshold), then this difference in true_df and df won't cause an issue because the difference isn't large enough for the p-value to go above the threshold. However, if the p-value is close to the threshold, then the difference in true_df and df can matter, it just depends on the difference between the df values and how close the p-value is to the threshold.
I don't disagree that this could also be partly due to how the threshold is calculated, it is just worth noting that the df difference is a possibility also.
Hi! After struggling to search and carefully read all the materials online I still have this problem:
To simplify
Is that a normal situation that some of the eGenes selected with qval <=0.05 from
*.genes.txt
are not included in the*.signif_variant_gene_pairs.txt
? Or is there something wrong if that happened?Detailed description
I have used the gtex-v8 pipeline to call eQTLs on my own dataset. And I successfully got the final results:
*.genes.txt
(which I confirmed is the same as*.egenes.txt
file).Using this file, I selected the significant eGenes through filtering the qval <=0.05. I also used the
annotate_outputs.py
to generate the*.signif_variant_gene_pairs.txt
from my*.allpairs.txt
and*.genes.txt
My aim is to check all the significant variants for the eGenes that with qval <=0.05. (I understand that in the original
*.genes.txt
there was only the most significant variant shown.)The strange thing is some eGenes with qval <=0.05 did not appear in
*.signif_variant_gene_pairs.txt.
Then I have checked the code in
annotate_outputs.py
, and I found that those egenes that did not show in*.signif_variant_gene_pairs.txt
have a common feature: pval_nominal is larger than the pval_nominal_threshold(I checked them from the*.genes.txt
). That made them lost in the significant gene pairs file.In my understanding and reading, you have mentioned that
I also read Section_ 4.2 of the supplementary materials for the paper https://www.science.org/doi/full/10.1126/science.aaz1776 (from which you mentioned in the issue #2 ). It seems the assessment for eGenes(qval <=0.05) and significant gene-variants pairs are different:
Is that means different statistical ways were used in these two table? I have tried my best to understand the statistical model that was mentioned but I think I might be thinking wrong in some way. Thanks for your reading and hoping for your reply!