YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
967 stars 246 forks source link

Different behaviour of gseGO upon setting nPerm=1000 #695

Closed olgabaranov closed 3 weeks ago

olgabaranov commented 1 month ago

On my attempt to increase the number of permutations in gseGO I am faced with an unexpected behaviour of the function:

> set.seed(1)
> gse <- gseGO(fullist,
        ont = "BP",
        keyType = "ENSEMBL",
        OrgDb = "org.Hs.eg.db"
)
> dim(gse)

[1] 7 11

When repeating it with nPerm = 1000 actively set:

> set.seed(1)
> gse <- gseGO(fullist,
        ont = "BP",
        keyType = "ENSEMBL",
        OrgDb = "org.Hs.eg.db",
        nPerm = 1000
)
> dim(gse)

[1] 0 11

But as far as I know, nPerm=1000 is the default so I would expect same results for both. The result is the same (0) regardless of the exact value of nPerm.

Apologies for not providing a reproducible example as I can't find a public dataset to use for it. I can try to make it work later...

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] GSEABase_1.66.0             graph_1.82.0               
 [3] annotate_1.82.0             XML_3.99-0.16.1            
 [5] BiocManager_1.30.23         biomaRt_2.60.0             
 [7] org.Hs.eg.db_3.19.1         AnnotationDbi_1.66.0       
 [9] pathfindR_2.4.1.9000        pathfindR.data_2.1.0       
[11] EnhancedVolcano_1.22.0      ggrepel_0.9.5              
[13] clusterProfiler_4.12.0      sva_3.52.0                 
[15] BiocParallel_1.38.0         genefilter_1.86.0          
[17] mgcv_1.8-41                 nlme_3.1-162               
[19] DESeq2_1.44.0               SummarizedExperiment_1.34.0
[21] Biobase_2.64.0              MatrixGenerics_1.16.0      
[23] matrixStats_1.3.0           GenomicRanges_1.56.0       
[25] GenomeInfoDb_1.40.0         IRanges_2.38.0             
[27] S4Vectors_0.42.0            BiocGenerics_0.50.0        
[29] lubridate_1.9.3             forcats_1.0.0              
[31] stringr_1.5.1               dplyr_1.1.4                
[33] purrr_1.0.2                 readr_2.1.5                
[35] tidyr_1.3.1                 tibble_3.2.1               
[37] ggplot2_3.5.1               tidyverse_2.0.0            
guidohooiveld commented 1 month ago

Did you notice the 2 warnings when running the function with the argument nPerm = 1000 explicitly provided?

Warning messages:
1: In .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize,  :
  We do not recommend using nPerm parameter incurrent and future releases
2: In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize,  :
  You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
>

This implicitly means that you are comparing 2 different "modes" of GSEA; your first run (without providing nPerm = 1000) runs the recommended mode of GSEA, which is through the function fgseaMultilevel, that calculates exact p-values.

By contrast, your second run utilizes the function fgseaSimple, that calculates p-values with slightly limited accuracy.

Note that under the hood both functions are provided by the library fgsea. See the fgsea preprint for more info on the (differences between the) 2 methods/functions: https://doi.org/10.1101/060012 .

Thus, it is anticipated that the results of both runs will be slightly different. See below for some code that indeed shows this.

Yet, this does not mean that for both functions fully reproducible results can not be obtained! To do so you will indeed need to define a seed value, but you will have to include the use of that seed value when calling gseGO by changing the argument seed to true (seed = TRUE); default isseed = FALSE. See also this thread: https://github.com/YuLab-SMU/clusterProfiler/issues/466

> ## load libraries
> library(clusterProfiler)
> library(org.Hs.eg.db)
> 
> ##load sample data
> data(geneList, package="DOSE")
> 
> ## run GSEA by using recommended mode; but do not apply a significance cutoff!
> res.recom <- gseGO(geneList = geneList,
+               OrgDb         = org.Hs.eg.db,
+               ont           = "CC",
+               minGSSize     = 100,
+               maxGSSize     = 500,
+               pvalueCutoff  = 1,
+               eps           = 0)
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
> 
> ## run GSEA by using permutations (thus with slightly reduced accuracy)
> ## note the warnings!
> res.permu <- gseGO(geneList = geneList,
+               OrgDb        = org.Hs.eg.db,
+               ont          = "CC",
+               minGSSize    = 100,
+               maxGSSize    = 500,
+               pvalueCutoff = 1,
+               nPerm        = 1000)
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning messages:
1: In .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize,  :
  We do not recommend using nPerm parameter incurrent and future releases
2: In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize,  :
  You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
>
> 
> ## compare p-values; note that these are not identical!
> identical(res.recom, res.permu)
[1] FALSE
>
> merged.res1 <- merge( as.data.frame(res.recom), as.data.frame(res.permu), by.x="ID", by.y="ID")
> plot(  merged.res1$pvalue.x, merged.res1$pvalue.y ) 
> 
> 

image

> ## below code to show that both functions will return fully reproducible results when a seed is used
> set.seed(1234)
> 
> ## run GSEA by using permutations
> run1 <- gseGO(geneList     = geneList,
+               OrgDb        = org.Hs.eg.db,
+               ont          = "CC",
+               minGSSize    = 100,
+               maxGSSize    = 500,
+               pvalueCutoff = 1,
+               nPerm        = 1000,
+               verbose      = FALSE,
+               seed         = TRUE)
Warning messages:
1: In .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize,  :
  We do not recommend using nPerm parameter incurrent and future releases
2: In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize,  :
  You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
> 
> 
> run2 <- gseGO(geneList     = geneList,
+               OrgDb        = org.Hs.eg.db,
+               ont          = "CC",
+               minGSSize    = 100,
+               maxGSSize    = 500,
+               pvalueCutoff = 1,
+               nPerm        = 1000,
+               verbose      = FALSE,
+               seed         = TRUE)
Warning messages:
1: In .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize,  :
  We do not recommend using nPerm parameter incurrent and future releases
2: In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize,  :
  You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
> 
> ## results are identical, as illustrated by perfect diagonal!
> identical(run1, run2)
[1] TRUE
>
>
> merged.res2 <- merge( as.data.frame(run1), as.data.frame(run2), by.x="ID", by.y="ID")
> plot(  merged.res2$pvalue.x, merged.res2$pvalue.y ) 
> 
>

image


> ## repeat for running GSEA through the recommended mode.
> ## note that seed has already been set above.
> 
> run3 <- gseGO(geneList     = geneList,
+               OrgDb        = org.Hs.eg.db,
+               ont          = "CC",
+               minGSSize    = 100,
+               maxGSSize    = 500,
+               pvalueCutoff = 1,
+               eps          = 0,
+               verbose      = FALSE,
+               seed         = TRUE)
> 
> 
> run4 <- gseGO(geneList     = geneList,
+               OrgDb        = org.Hs.eg.db,
+               ont          = "CC",
+               minGSSize    = 100,
+               maxGSSize    = 500,
+               pvalueCutoff = 1,
+               eps          = 0,
+               verbose      = FALSE,
+               seed         = TRUE)
> 
> ## results are again identical, as illustrated by another perfect diagonal!
> 
>
> identical(run3, run4)
[1] TRUE
>
> merged.res3 <- merge( as.data.frame(run3), as.data.frame(run4), by.x="ID", by.y="ID")
> plot(  merged.res3$pvalue.x, merged.res3$pvalue.y ) 
> 

image

olgabaranov commented 3 weeks ago

Thanks a lot for the detailed explanation!