YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
997 stars 252 forks source link

groupGO gets stuck in R 3.5.0 #155

Closed romanhaa closed 3 years ago

romanhaa commented 6 years ago

I'm experiencing a problem running groupGO (doesn't finish) in R 3.5.0. The command runs fine in R 3.4.4, however the clusterProfiler version is different between the two. All packages were freshly installed and therefore should beup-to-date (see sessionInfo() below).

Commands

library('clusterProfiler')
data(gcSample)
yy <- groupGO(gcSample[[1]], 'org.Hs.eg.db', ont="BP", level=2)

What happens

In R 3.5.0 I had to abort this command because it didn't finish even after 30 minutes. In R 3.4.4 it is correctly executed within 1 minute.

System

In both containers, I ran:

source('https://bioconductor.org/biocLite.R')
biocLite('clusterProfiler')

sessionInfo R 3.4.4

R version 3.4.4 RC (2018-03-09 r74380)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] org.Hs.eg.db_3.5.0    AnnotationDbi_1.40.0  IRanges_2.12.0
[4] S4Vectors_0.16.0      Biobase_2.38.0        BiocGenerics_0.24.0
[7] clusterProfiler_3.6.0 DOSE_3.4.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17        compiler_3.4.4      pillar_1.2.3
 [4] plyr_1.8.4          bindr_0.1.1         tools_3.4.4
 [7] digest_0.6.15       bit_1.1-14          RSQLite_2.1.1
[10] memoise_1.1.0       tibble_1.4.2        gtable_0.2.0
[13] pkgconfig_2.0.1     rlang_0.2.1         igraph_1.2.1
[16] fastmatch_1.1-0     DBI_1.0.0           rvcheck_0.1.0
[19] bindrcpp_0.2.2      gridExtra_2.3       fgsea_1.4.1
[22] stringr_1.3.1       dplyr_0.7.6         tidyselect_0.2.4
[25] bit64_0.9-7         grid_3.4.4          qvalue_2.10.0
[28] glue_1.2.0          data.table_1.11.4   R6_2.2.2
[31] BiocParallel_1.12.0 GOSemSim_2.4.1      tidyr_0.8.1
[34] reshape2_1.4.3      purrr_0.2.5         magrittr_1.5
[37] GO.db_3.5.0         ggplot2_3.0.0       DO.db_2.9
[40] blob_1.1.1          splines_3.4.4       scales_0.5.0
[43] assertthat_0.2.0    colorspace_1.3-2    stringi_1.2.3
[46] lazyeval_0.2.1      munsell_0.5.0

sessionInfo R 3.5.0

R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] org.Hs.eg.db_3.6.0    AnnotationDbi_1.42.1  IRanges_2.14.10
[4] S4Vectors_0.18.3      Biobase_2.40.0        BiocGenerics_0.26.0
[7] clusterProfiler_3.8.1 BiocInstaller_1.30.0

loaded via a namespace (and not attached):
 [1] ggrepel_0.8.0       Rcpp_0.12.17        lattice_0.20-35
 [4] tidyr_0.8.1         GO.db_3.6.0         assertthat_0.2.0
 [7] digest_0.6.15       ggforce_0.1.3       R6_2.2.2
[10] plyr_1.8.4          ggridges_0.5.0      RSQLite_2.1.1
[13] ggplot2_3.0.0       pillar_1.2.3        rlang_0.2.1
[16] lazyeval_0.2.1      data.table_1.11.4   blob_1.1.1
[19] Matrix_1.2-14       qvalue_2.12.0       splines_3.5.0
[22] BiocParallel_1.14.2 stringr_1.3.1       igraph_1.2.1
[25] bit_1.1-14          munsell_0.5.0       fgsea_1.6.0
[28] compiler_3.5.0      pkgconfig_2.0.1     tidyselect_0.2.4
[31] tibble_1.4.2        gridExtra_2.3       enrichplot_1.0.2
[34] viridisLite_0.3.0   dplyr_0.7.6         MASS_7.3-50
[37] grid_3.5.0          gtable_0.2.0        DBI_1.0.0

Extract of code (R 3.4.4)

> library('clusterProfiler')

Loading required package: DOSE

DOSE v3.4.0  For help: https://guangchuangyu.github.io/DOSE

If you use DOSE in published research, please cite:
Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609

clusterProfiler v3.6.0  For help: https://guangchuangyu.github.io/clusterProfiler

If you use clusterProfiler in published research, please cite:
Guangchuang Yu., Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

> data(gcSample)
> yy <- groupGO(gcSample[[1]], 'org.Hs.eg.db', ont="BP", level=2)

Loading required package: org.Hs.eg.db
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

# finished within 1 minute

> head(yy)

                   ID            Description Count GeneRatio ...
GO:0000003 GO:0000003           reproduction    16    16/216 ...
GO:0008152 GO:0008152      metabolic process   120   120/216 ...
GO:0001906 GO:0001906           cell killing     4     4/216 ...
GO:0002376 GO:0002376  immune system process    60    60/216 ...
GO:0006791 GO:0006791     sulfur utilization     0     0/216 ...
GO:0006794 GO:0006794 phosphorus utilization     0     0/216 ...

Extract of code (R 3.5.0)

> library('clusterProfiler')

clusterProfiler v3.8.1  For help: https://guangchuangyu.github.io/software/clusterProfiler

If you use clusterProfiler in published research, please cite:
Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

> data(gcSample)
> yy <- groupGO(gcSample[[1]], 'org.Hs.eg.db', ont="BP", level=2)

Loading required package: org.Hs.eg.db
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

# 30 minutes passed
^C
romanhaa commented 6 years ago

Just wanted to add that I just tried the same with R 3.5.1 and it also gets stuck.

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] org.Hs.eg.db_3.6.0    AnnotationDbi_1.42.1  IRanges_2.14.10
[4] S4Vectors_0.18.3      Biobase_2.40.0        BiocGenerics_0.26.0
[7] clusterProfiler_3.8.1 BiocInstaller_1.30.0

loaded via a namespace (and not attached):
 [1] ggrepel_0.8.0       Rcpp_0.12.17        lattice_0.20-35
 [4] tidyr_0.8.1         GO.db_3.6.0         assertthat_0.2.0
 [7] digest_0.6.15       ggforce_0.1.3       R6_2.2.2
[10] plyr_1.8.4          ggridges_0.5.0      RSQLite_2.1.1
[13] ggplot2_3.0.0       pillar_1.2.3        rlang_0.2.1
[16] lazyeval_0.2.1      data.table_1.11.4   blob_1.1.1
[19] Matrix_1.2-14       qvalue_2.12.0       splines_3.5.1
[22] BiocParallel_1.14.2 stringr_1.3.1       igraph_1.2.1
[25] bit_1.1-14          munsell_0.5.0       fgsea_1.6.0
[28] compiler_3.5.1      pkgconfig_2.0.1     tidyselect_0.2.4
[31] tibble_1.4.2        gridExtra_2.3       enrichplot_1.0.2
[34] viridisLite_0.3.0   dplyr_0.7.6         MASS_7.3-50
[37] grid_3.5.1          gtable_0.2.0        DBI_1.0.0
[40] magrittr_1.5        units_0.6-0         scales_0.5.0
[43] stringi_1.2.3       GOSemSim_2.6.0      reshape2_1.4.3
[46] viridis_0.5.1       bindrcpp_0.2.2      DO.db_2.9
[49] rvcheck_0.1.0       cowplot_0.9.2       fastmatch_1.1-0
[52] tools_3.5.1         bit64_0.9-7         glue_1.2.0
[55] tweenr_0.1.5        purrr_0.2.5         ggraph_1.0.2
[58] colorspace_1.3-2    UpSetR_1.3.3        DOSE_3.6.1
[61] memoise_1.1.0       bindr_0.1.1
GuangchuangYu commented 6 years ago

see the support many ID types session on http://guangchuangyu.github.io/2016/01/go-analysis-using-clusterprofiler/.

In order to support different ID types, now groupGO, enrichGO and gseGO use select interface to get the GO annotation from OrgDb, and such implementation (in AnnotationDbi package) is quite slow.

However, this should not take more than 30min in ordinary PC.

romanhaa commented 6 years ago

Yes I understand that and have experienced myself that select() takes some time. However, I'm not sure this is the reason here since it works fine with R 3.4.4. I also noticed that the R sessions that get stuck are also consuming a lot of CPU resources.

kkolmus commented 6 years ago

@romanhaa, I experienced the same issue. It now takes ages to do any analysis with R 3.5.1

romanhaa commented 6 years ago

@Krzysztof-Piotr I ended up using the enrichR API for my purposes: https://cran.r-project.org/web/packages/enrichR/index.html

MaxKman commented 5 years ago

Hi,

Thanks so much for developing ClusterProfiler. It is a very useful package. However, something happened since I last used it that makes the gseGO function now nearly unusable. I just let gsego run the entire night and and it didn't finish similiar to the issue described above.

The command I used:

GSEA_out <- gseGO(geneList = GSEA_in,
              OrgDb        = org.Hs.eg.db,
              ont          = "BP",
              nPerm        = 1000,
              pvalueCutoff = 1, #later select significant genes
              verbose      = T,
              keyType = "ENSEMBL")  

The output:

preparing geneSet collections... GSEA analysis... There are ties in the preranked stats (4.57% of the list). The order of those tied genes will be arbitrary, which may produce unexpected results.

Potentially related: The example from the vignette now fails with the error message

Error in data.frame(ID = as.character(tmp_res$pathway), Description = Description, : row names contain missing values

Traceback():

5: stop("row names contain missing values") 4: data.frame(ID = as.character(tmp_res$pathway), Description = Description, setSize = tmp_res$size, enrichmentScore = tmp_res$ES, NES = tmp_res$NES, pvalue = tmp_res$pval, p.adjust = p.adj, qvalues = qvalues, stringsAsFactors = FALSE) 3: .GSEA(geneList = geneList, exponent = exponent, nPerm = nPerm, minGSSize = minGSSize, maxGSSize = maxGSSize, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, verbose = verbose, seed = seed, USER_DATA = USER_DATA) 2: GSEA_internal(geneList = geneList, exponent = exponent, nPerm = nPerm, minGSSize = minGSSize, maxGSSize = maxGSSize, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, verbose = verbose, USER_DATA = GO_DATA, seed = seed, by = by) 1: gseGO(geneList = geneList, OrgDb = org.Hs.eg.db, ont = "CC", nPerm = 1000, minGSSize = 100, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE)

However, when I set ont to "BP" it runs through ok.

For my own dataset the gene list does not seem to be the problem since gseGO completes the computation with nPerm = 10 in 61sec

Session info:

R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS 10.14

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] org.Hs.eg.db_3.7.0 AnnotationDbi_1.44.0 IRanges_2.16.0
[4] S4Vectors_0.20.1 Biobase_2.40.0 BiocGenerics_0.28.0
[7] cividis_0.2.0 gridExtra_2.3 forcats_0.3.0
[10] stringr_1.3.1 dplyr_0.7.8 purrr_0.3.0
[13] readr_1.3.1 tidyr_0.8.2 tibble_2.0.1
[16] ggplot2_3.1.0.9000 tidyverse_1.2.1 clusterProfiler_3.10.1 [19] reticulate_1.10.0.9001

loaded via a namespace (and not attached): [1] nlme_3.1-137 enrichplot_1.0.2 lubridate_1.7.4
[4] bit64_0.9-7 httr_1.4.0 UpSetR_1.3.3
[7] tools_3.5.1 backports_1.1.3 R6_2.3.0
[10] DBI_1.0.0 lazyeval_0.2.1 colorspace_1.4-0
[13] withr_2.1.2 tidyselect_0.2.5 bit_1.1-14
[16] compiler_3.5.1 cli_1.0.1 rvest_0.3.2
[19] xml2_1.2.0 scales_1.0.0 ggridges_0.5.1
[22] digest_0.6.18 DOSE_3.8.2 pkgconfig_2.0.2
[25] rlang_0.3.1 readxl_1.2.0 rstudioapi_0.9.0
[28] RSQLite_2.1.1 bindr_0.1.1 generics_0.0.2
[31] jsonlite_1.6 BiocParallel_1.14.2 GOSemSim_2.6.2
[34] magrittr_1.5 GO.db_3.6.0 Matrix_1.2-14
[37] Rcpp_1.0.0 munsell_0.5.0 viridis_0.5.1
[40] stringi_1.2.4 yaml_2.2.0 ggraph_1.0.2
[43] MASS_7.3-50 plyr_1.8.4 qvalue_2.12.0
[46] grid_3.5.1 blob_1.1.1 ggrepel_0.8.0.9000 [49] DO.db_2.9 crayon_1.3.4 lattice_0.20-35
[52] haven_2.0.0 cowplot_0.9.3 splines_3.5.1
[55] hms_0.4.2 knitr_1.21 pillar_1.3.1
[58] fgsea_1.6.0 igraph_1.2.2 reshape2_1.4.3
[61] fastmatch_1.1-0 glue_1.3.0 data.table_1.12.0
[64] modelr_0.1.2 tweenr_0.1.5 cellranger_1.1.0
[67] gtable_0.2.0 assertthat_0.2.0 xfun_0.4
[70] ggforce_0.1.3 broom_0.5.1 viridisLite_0.3.0
[73] rvcheck_0.1.0 memoise_1.1.0 units_0.6-0
[76] bindrcpp_0.2.2

EDIT: updating to the current clusterProfiler version from github (3.11.1) didn't solve the problem.

Best regards Max

MaxKman commented 5 years ago

I did some further digging. There seem to be two different problems here:

a) when working with ENSEMBL identifiers fgsea runs forever. It runs very quickly with ENTREZID.

b) Description <- TERM2NAME(tmp_res$pathway, USER_DATA)

within DOSE:::GSEA_fgsea() returns a named vector, which apparently includes NAs in the vector names leading to the error message

Error in data.frame(ID = as.character(tmp_res$pathway), Description = Description, : row names contain missing values when res <- data.frame is assembled.

Description <- unname(Description) solves this.

Best Max

sk-sahu commented 5 years ago

I am facing a same issue. With encrichGO() its taking forever, even for 10 Ensembl IDs.

maryellenlynall commented 5 years ago

I'm getting the same problem with GSEA

lucolotto commented 4 years ago

Hi all, I am facing the same issue and saw this post. I was wondering if you could explain me how to use this solution:

Description <- unname(Description)

to fix the problem.

Many thanks Best Luca

GuangchuangYu commented 3 years ago

we optimized the code and now it take less memory and run faster in clusterProfiler 4.0