broadinstitute / gtex-v8

Notebooks and scripts for reproducing analyses and figures from the V8 GTEx Consortium paper
BSD 3-Clause "New" or "Revised" License
40 stars 13 forks source link

Would it be possible that some egenes that have qval <=0.05 that inside the *.egenes.txt file not appears in the *.signif_variant_gene_pairs.txt file? #4

Closed xiyasong closed 2 years ago

xiyasong commented 2 years ago

Hi! After struggling to search and carefully read all the materials online I still have this problem:

To simplify

Is that a normal situation that some of the eGenes selected with qval <=0.05 from *.genes.txt are not included in the *.signif_variant_gene_pairs.txt? Or is there something wrong if that happened?

Detailed description

I have used the gtex-v8 pipeline to call eQTLs on my own dataset. And I successfully got the final results: *.genes.txt (which I confirmed is the same as *.egenes.txt file).

Using this file, I selected the significant eGenes through filtering the qval <=0.05. I also used the annotate_outputs.py to generate the *.signif_variant_gene_pairs.txtfrom my *.allpairs.txtand*.genes.txt

My aim is to check all the significant variants for the eGenes that with qval <=0.05. (I understand that in the original *.genes.txt there was only the most significant variant shown.)

The strange thing is some eGenes with qval <=0.05 did not appear in *.signif_variant_gene_pairs.txt.

Then I have checked the code in annotate_outputs.py, and I found that those egenes that did not show in *.signif_variant_gene_pairs.txt have a common feature: pval_nominal is larger than the pval_nominal_threshold(I checked them from the *.genes.txt ). That made them lost in the significant gene pairs file.

In my understanding and reading, you have mentioned that

the *.signif_variant_gene_pairs.txt files contain all pairs that pass the significance threshold *for each eGene. So all the eGenes (which means the genes that have a qval <0.05 in the `.genes.txtfile? )should have at least one pair in the*.signif_variant_gene_pairs.txt `. But it did not.

I also read Section_ 4.2 of the supplementary materials for the paper https://www.science.org/doi/full/10.1126/science.aaz1776 (from which you mentioned in the issue #2 ). It seems the assessment for eGenes(qval <=0.05) and significant gene-variants pairs are different:

The beta distribution-extrapolated empirical P-values from FastQTL were used to calculate gene-level q-values [73] with a fixed P-value interval for the estimation of π0 (the ‘lambda’ parameter was set to 0.85).

For each gene, variants with a nominal P-value below the gene-level threshold (fig. S8) were considered significant and included in the final list of variant-gene pairs

Is that means different statistical ways were used in these two table? I have tried my best to understand the statistical model that was mentioned but I think I might be thinking wrong in some way. Thanks for your reading and hoping for your reply!

kwcurrin commented 2 years ago

Hello,

I had a similar issue and a colleague helped me figure it out. The nominal p-value reported by fastQTL was calculated using a different value for degrees of freedom (df) than the beta-derived p-value. The reported nominal p-value was calculated using df = sample size - 2 - number of covariates, whereas for the beta-derived p-value, the nominal p-value was first calculated using true_df (which usually doesn't match df and is different for each gene), and then this nominal p-value was adjusted using the beta distribution. If true_df is sufficiently larger than df for a given gene, then the nominal threshold for that gene, calculated from the beta-adjusted p-value using qbeta, may be more significant than the most significant nominal p-value reported for that gene.

Can you confirm if the genes that are significant in genes.txt but not allpairs.txt have a true_df value that is larger than sample size - 2 - number of covariates?

Thanks,

Kevin

francois-a commented 2 years ago

How many genes are missing from your signif_variant_gene_pairs.txt file? In some cases, genes close to the FDR cutoff might get excluded due to how the threshold is computed here.

xiyasong commented 2 years ago

Hi, @kwcurrin and @francois-a Thank you all for helping and explaining here.

@kwcurrin In my result, the genes that are significant in genes.txt (I filtered out qval <0.05) but not signif_pairs.txt (I noticed you mentioned allpairs.txt but I am using signif_pairs.txt )indeed all have a true_df value that is larger than sample size - 2 - number of covariates (In my case 73 ). But I found that some of those genes that appear in both files also have true_df value that is larger than sample size - 2 - number of covariates. table(my_egenes_q0.05$true_df >=73) FALSE TRUE 600 205 table(diff_table$true_df >=73) TRUE 17 @francois-a Hi Francois, there are 17 genes missing from signif_variant_gene_pairs.txt(I put them in diff_table , and I put the common genes shown both in two files as common_table.

table(diff_table$pval_nominal >= diff_table$pval_nominal_threshold) TRUE 17 table(common_table$pval_nominal >= common_table$pval_nominal_threshold) FALSE 788

So that's pval_nominal >= pval_nominal_threshold is the reason for this situation. I think it is similar to what you mentioned about the computed threshold. I put this reply here in case someone has similar concerns.

kwcurrin commented 2 years ago

Hi xiyasong,

Sorry, I should have clarified. It is possible for a gene in both files to also have true_df > than df because this is only an issue really when the nominal p-value is close to the threshold. When true_df is larger than df, the reported nominal p-value calculated with df will be larger than the p-value calculated with true_df. But if the p-value is very significant (much smaller than the threshold), then this difference in true_df and df won't cause an issue because the difference isn't large enough for the p-value to go above the threshold. However, if the p-value is close to the threshold, then the difference in true_df and df can matter, it just depends on the difference between the df values and how close the p-value is to the threshold.

I don't disagree that this could also be partly due to how the threshold is calculated, it is just worth noting that the df difference is a possibility also.