andreyshabalin / MatrixEQTL

Matrix eQTL: Ultra fast eQTL analysis via large matrix operations
58 stars 17 forks source link

FDR: how is it calculated by the package? #11

Closed complexgenome closed 4 years ago

complexgenome commented 4 years ago

hi there devs,

Thanks for this awesome package. I'm using ~120 genes and ~44K SNPs to perform cis eqtl analysis. I use 2Mb as cis eqtl boundary.

pvOutputThreshold_cis = 0.1;
pvOutputThreshold_tra = 0.1;
pvOutputThreshold = 0.1;

me = Matrix_eQTL_main(
    snps=snps,    gene=gene, 
    output_file_name = NULL,
    pvOutputThreshold=pvOutputThreshold,
    useModel=useModel,
    errorCovariance=errorCovariance,
    verbose=TRUE,pvalue.hist=FALSE,min.pv.by.genesnp = FALSE,noFDRsaveMemory = FALSE,
    snpspos=snpspos,cisDist=2e+06,genepos=genepos,pvOutputThreshold.cis=pvOutputThreshold_cis);

df_eqtl<-me$cis$eqtls

After analysis, I extract cis eqtls. However, I'd like to replicate the FDr generated by the matrix eqtl function. First, I use p.adjust method FDR in the output the p-value are little off (better).

df_eqtl_jointed$TEST_FDR<-p.adjust(p=df_eqtl$pvalue,n=nrow(df_eqtl),method="fdr")

snps gene chr pos statistic pvalue FDR beta gene_type TEST_FDR
16:64975370:G:A ENSG00000166592 16 64975370 8.694963 6.32E-15 7.87E-09 17.23002 protein_coding 8.17E-10
16:65320793:G:C ENSG00000166592 16 65320793 7.939682 4.80E-13 1.79E-08 10.07585 protein_coding 1.85E-09
16:65321249:C:T ENSG00000166592 16 65321249 7.939682 4.80E-13 1.79E-08 10.07585 protein_coding 1.85E-09
16:65321591:T:G ENSG00000166592 16 65321591 7.939682 4.80E-13 1.79E-08 10.07585 protein_coding 1.85E-09
16:65322806:T:C ENSG00000166592 16 65322806 7.939682 4.80E-13 1.79E-08 10.07585 protein_coding 1.85E-09

Number of rows in the cis eqtl output= 129196

If I use the number of tests when calculating FDR, head(me$param)

$ntests
[1] 30385434

p.adjust(p=df_eqtl_jointed$pvalue,n=30385434,method="fdr")

The output looks like: snps gene start_hg19 end_hg19 pos statistic pvalue FDR beta chrom TEST_FDR_NTEST
16:64975370:G:A ENSG00000166592 66955582 66959547 64975370 8.694963 6.32E-15 7.87E-09 17.23002 chr16 1.92E-07
16:65320793:G:C ENSG00000166592 66955582 66959547 65320793 7.939682 4.80E-13 1.79E-08 10.07585 chr16 4.36E-07
16:65321249:C:T ENSG00000166592 66955582 66959547 65321249 7.939682 4.80E-13 1.79E-08 10.07585 chr16 4.36E-07
16:65321591:T:G ENSG00000166592 66955582 66959547 65321591 7.939682 4.80E-13 1.79E-08 10.07585 chr16 4.36E-07

I'm interested to know on FDR because I'd subset genes from the analysis. So, knowing correct method would help me to calculate FDR for them separately.

Also, in the output, is FDR calculated for each gene individually? Just curious.

Thank you.

andreyshabalin commented 4 years ago

Dear Sariya,

You almost got there. You should be correcting for the actual number of tests in cis-eQTL analysis. It is reported in me$cis$eqtls$pvalue. Here is the code confirming the FDR you'd get is the same.

me$cis$eqtls$FDR_p.adjust = p.adjust(p = me$cis$eqtls$pvalue, n = me$cis$ntests, method = "fdr")

range( me$cis$eqtls$FDR_p.adjust / me$cis$eqtls$FDR )

# 1 1

Andrey