YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.03k stars 256 forks source link

enricher not working properly #739

Closed Gambrian closed 3 weeks ago

Gambrian commented 3 weeks ago

> go_res = enricher(gene = target_id,
                    TERM2GENE = go_df[,c("go_id","gene")],
                    TERM2NAME = go_df[,c("go_id","go_description")],
                    pvalueCutoff = 0.1,
                    pAdjustMethod = "fdr",
                    qvalueCutoff = 1,
                    minGSSize = 10,
                    maxGSSize = 500)

> go_res %>% as.data.frame()
             ID Description GeneRatio  BgRatio RichFactor FoldEnrichment   zScore       pvalue    p.adjust      qvalue
0005814 0005814   centriole      8/97  70/7786 0.11428571       9.173490 7.715085 2.169533e-06 0.001321246 0.001281166
0005929 0005929      cilium      7/97 101/7786 0.06930693       5.563132 5.184203 2.480146e-04 0.075520457 0.073229585
                                                         geneID Count
0005814 H7BZ55/Q9BZW7/Q96RK4/P41208/Q8TC44/Q4AC94/Q6UVJ0/Q8NA72     8
0005929        Q96RK4/Q6IPM2/Q8N271/Q9NWB7/P59190/Q6TDU7/Q8N4P2     7

> go_res@result %>% filter(pvalue <= 0.1) %>% nrow
[1] 78

> go_res@result %>% filter(pvalue <= 0.1) %>% head()
             ID                                   Description GeneRatio  BgRatio RichFactor FoldEnrichment   zScore       pvalue    p.adjust
0005814 0005814                                     centriole      8/97  70/7786 0.11428571       9.173490 7.715085 2.169533e-06 0.001321246
0005929 0005929                                        cilium      7/97 101/7786 0.06930693       5.563132 5.184203 2.480146e-04 0.075520457
0032391 0032391               photoreceptor connecting cilium      3/97  17/7786 0.17647059      14.164948 6.102971 1.122946e-03 0.181097783
0035925 0035925            mRNA 3'-UTR AU-rich region binding      3/97  18/7786 0.16666667      13.378007 5.904901 1.335418e-03 0.181097783
1990837 1990837 sequence-specific double-stranded DNA binding      6/97 100/7786 0.06000000       4.816082 4.313679 1.486846e-03 0.181097783
0036064 0036064                            ciliary basal body      5/97  81/7786 0.06172840       4.954817 4.018490 3.284834e-03 0.269016734
             qvalue                                                  geneID Count
0005814 0.001281166 H7BZ55/Q9BZW7/Q96RK4/P41208/Q8TC44/Q4AC94/Q6UVJ0/Q8NA72     8
0005929 0.073229585        Q96RK4/Q6IPM2/Q8N271/Q9NWB7/P59190/Q6TDU7/Q8N4P2     7
0032391 0.175604280                                    Q96RK4/P41208/Q9NWB7     3
0035925 0.175604280                                    Q13151/P31483/Q15717     3
1990837 0.175604280               O95644/Q00978/P00519/O14770/P15336/P13056     6
0036064 0.260856257                      Q96RK4/P41208/Q9NWB7/Q8TC44/Q4AC94     5

> packageVersion("clusterProfiler")
[1] ‘4.12.6’

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

I'm wondering how the enricher filters the results by these cutoffs, I also saw issue #737 but I think I'm encountering a bug rather than a feature

guidohooiveld commented 3 weeks ago

You may want to rephrase your question.

By converting the results into a data.frame filtering is applied; first on (adjusted) p-values, and then on q-value. This is highlighted in https://github.com/YuLab-SMU/clusterProfiler/issues/737. This filtering is automagically done, because the filtering function get_enriched() is being used when converting the results of an enrichment analysis into a data.frame. See the code definitions in accessor.R (https://github.com/YuLab-SMU/DOSE/blob/4bdf261db7e2d088ca588a564f1db5a85ba522b7/R/accessor.R#L1)

https://github.com/YuLab-SMU/DOSE/blob/4bdf261db7e2d088ca588a564f1db5a85ba522b7/R/enricher_internal.R#L213-L234

In the first part of your code (go_res %>% as.data.frame()) filtering is thus based on fdr only (i.e. fdr smaller or equal 0.1, and fdr only because qvalueCutoff = 1). Note that fdr results are present in the column p.adjust. https://github.com/YuLab-SMU/DOSE/blob/4bdf261db7e2d088ca588a564f1db5a85ba522b7/R/enricher_internal.R#L147

By using the @ accessor, you directly pull out the raw c.q. all results. Note that next you did filter for pvalue, and not p.adjust (the latter corresponds to the fdr)! If you should have filtered on p.adjust you will see that indeed the same / only 2 sets 'survive' as when extracting the results as data.frame (because all other sets have p.adjust > 0.1, i.e. 0.181, 0.26 etc).

Finally, by reading the help page for p.adjust (type: ?p.adjust) you will see: ".... Benjamini & Hochberg (1995) ("BH" or its alias "fdr") ...". Thus setting the method to fdr the Benjamini-Hochberg method for adjusting for multiple testing will be used. Thus fdr = BH.

Gambrian commented 3 weeks ago

Thanks for your reply, I think I made a mistake because I didn't see any description about pvalueCutoff

pvalueCutoff | adjusted pvalue cutoff on enrichment tests to report

so if I got some results and I want to filter by pvalue , I need to set pAdjustMethod to "none", right ? and I have another question about qvalue

> go_res = enricher(gene = target_id, +                   TERM2GENE = go_df[,c("go_id","gene")], +                   TERM2NAME = go_df[,c("go_id","go_description")], +                   pvalueCutoff = 0.05, +                   pAdjustMethod = "none", +                   qvalueCutoff = 1, +                   minGSSize = 10, +                   maxGSSize = 500) 
> qvalue(go_res@result$pvalue)$qvalue %>% head()
 [1] 0.0001686506 0.0096398217 0.0231162575 0.0231162575 0.0231162575 0.0343386871 
> go_res@result$qvalue %>% head() 
[1] 0.001281166 0.073229585 0.175604280 0.175604280 0.175604280 0.260856257
> packageVersion("qvalue")
[1] ‘2.36.0’

I want to know what causes this difference

guidohooiveld commented 3 weeks ago

Yes, if pAdjustMethod = "none", indeed no p-value adjustment for multiple testing will be applied. Values in columns pvalue and p.adjust will be identical.

Regarding gene set filtering: first values in in column pvalue will be used to select gene sets, 'surviving' sets will then be filtered for p.adjust (thus using the same cutoff value!), and finally surviving sets will be filtered for qvalue cutoff (default setting is qvalueCutoff = 0.2). On the help page of enrichGO: qvalueCutoff - qvalue cutoff on enrichment tests to report as significant. Tests must pass i) pvalueCutoff on unadjusted pvalues, ii) pvalueCutoff on adjusted pvalues and iii) qvalueCutoff on qvalues to be reported. The code doing this is thus the internal function get_enriched: https://github.com/YuLab-SMU/DOSE/blob/4bdf261db7e2d088ca588a564f1db5a85ba522b7/R/enricher_internal.R#L213-L234

Regarding the calculation of qvalues: That is done in this line in the function enricher_internal: https://github.com/YuLab-SMU/DOSE/blob/4bdf261db7e2d088ca588a564f1db5a85ba522b7/R/enricher_internal.R#L148

Values can be fully reproduced by manual calculation. See below.

> library(clusterProfiler)
>  
> ## load example data
> data(geneList, package = "DOSE")
> de <- names(geneList)[1:100]
> 
> ## default analysis
> y1 <- enrichGO(de, 'org.Hs.eg.db',
+               ont = "CC",
+               pvalueCutoff = 0.05,
+               pAdjustMethod = "BH",
+               qvalueCutoff = 0.2)
> 
> 
> ## Manual calculation of qvalues
> ## note that both pvalueCutoff and qvalueCutoff are set to 1,
> ## because ALL pvalues have to be used for qvalue calculations!
> library(qvalue)
> 
> y2 <- enrichGO(de, 'org.Hs.eg.db',
+               ont = "CC",
+               pvalueCutoff = 1,
+               pAdjustMethod = "none",
+               qvalueCutoff = 1)
> 
> qvalues <- qvalue(p=y2@result$pvalue, lambda=0.05, pi0.method="bootstrap")
> 
> ## check and compare
> ## qvalues are identical!
> 
> as.data.frame(y1)[1:6, "qvalue"]
[1] 8.448492e-17 8.448492e-17 8.448492e-17 9.585565e-17 2.477193e-16
[6] 8.026847e-16
> 
> head(qvalues$qvalues)
[1] 8.448492e-17 8.448492e-17 8.448492e-17 9.585565e-17 2.477193e-16
[6] 8.026847e-16
> 
> 
> 
Gambrian commented 3 weeks ago

Thank you for your patient answer