YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.01k stars 254 forks source link

enrichGO uses same number of background genes for different terms #553

Open Leonrunning opened 1 year ago

Leonrunning commented 1 year ago

Hi,

I have been using this wonderful package for a few years and it works great for my analysis of species like mouse and human data. Recently, I am working on the Gallus (chicken) and met some issues with enrichGo.

As shown below, the number of background genes for each term are all the same which should be different. I am using the latest versions of R and clusterProfiler. Would you please let me know how can I fix this issue?

thanks

1678089281119

Leonrunning commented 1 year ago

I found a similar issue #527. The background gene for GeneRatio for each term should be different however, they are all 100, used in ClusterProfiler. Are there any issue with the new version of clusterprofiler?

guidohooiveld commented 1 year ago

Could you please provide some reproducible code? How does your input look like? Do you analyze all GO categories, or only a subset? Etc...

guidohooiveld commented 1 year ago

@Leonrunning :

[[edited; removed my answer since I was wrong...]]

> library(clusterProfiler)
> 
> data(geneList, package = "DOSE")
> de <- names(geneList)[1:750]
> 
> yy <- enrichGO(de, 'org.Hs.eg.db', ont="BP", pvalueCutoff=1)

> 
> dim(yy)
[1] 778   9
> 
> as.data.frame(yy)[1:25,1:4]
                   ID
GO:0000070 GO:0000070
GO:0000819 GO:0000819
GO:0000280 GO:0000280
GO:0140014 GO:0140014
GO:0006261 GO:0006261
GO:0007059 GO:0007059
GO:0098813 GO:0098813
GO:1905818 GO:1905818
GO:0006260 GO:0006260
GO:0044786 GO:0044786
GO:0051983 GO:0051983
GO:0090329 GO:0090329
GO:0033046 GO:0033046
GO:0033048 GO:0033048
GO:2000816 GO:2000816
GO:0051985 GO:0051985
GO:1905819 GO:1905819
GO:0010965 GO:0010965
GO:0006268 GO:0006268
GO:0051304 GO:0051304
GO:0051306 GO:0051306
GO:0033047 GO:0033047
GO:0033260 GO:0033260
GO:0044772 GO:0044772
GO:0033045 GO:0033045
                                                           Description
GO:0000070                        mitotic sister chromatid segregation
GO:0000819                                sister chromatid segregation
GO:0000280                                            nuclear division
GO:0140014                                    mitotic nuclear division
GO:0006261                               DNA-templated DNA replication
GO:0007059                                      chromosome segregation
GO:0098813                              nuclear chromosome segregation
GO:1905818                         regulation of chromosome separation
GO:0006260                                             DNA replication
GO:0044786                                  cell cycle DNA replication
GO:0051983                        regulation of chromosome segregation
GO:0090329                 regulation of DNA-templated DNA replication
GO:0033046         negative regulation of sister chromatid segregation
GO:0033048 negative regulation of mitotic sister chromatid segregation
GO:2000816  negative regulation of mitotic sister chromatid separation
GO:0051985               negative regulation of chromosome segregation
GO:1905819                negative regulation of chromosome separation
GO:0010965           regulation of mitotic sister chromatid separation
GO:0006268                   DNA unwinding involved in DNA replication
GO:0051304                                       chromosome separation
GO:0051306                         mitotic sister chromatid separation
GO:0033047          regulation of mitotic sister chromatid segregation
GO:0033260                                     nuclear DNA replication
GO:0044772                         mitotic cell cycle phase transition
GO:0033045                  regulation of sister chromatid segregation
           GeneRatio   BgRatio
GO:0000070    45/720 204/18903
GO:0000819    47/720 239/18903
GO:0000280    67/720 481/18903
GO:0140014    54/720 325/18903
GO:0006261    36/720 166/18903
GO:0007059    55/720 382/18903
GO:0098813    50/720 321/18903
GO:1905818    29/720 111/18903
GO:0006260    46/720 286/18903
GO:0044786    19/720  42/18903
GO:0051983    31/720 132/18903
GO:0090329    21/720  57/18903
GO:0033046    20/720  51/18903
GO:0033048    20/720  51/18903
GO:2000816    20/720  51/18903
GO:0051985    20/720  53/18903
GO:1905819    20/720  53/18903
GO:0010965    26/720  98/18903
GO:0006268    14/720  22/18903
GO:0051304    30/720 135/18903
GO:0051306    26/720 101/18903
GO:0033047    20/720  56/18903
GO:0033260    17/720  38/18903
GO:0044772    57/720 473/18903
GO:0033045    26/720 107/18903
> 

> 
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clusterProfiler_4.6.1

loaded via a namespace (and not attached):
  [1] nlme_3.1-162           bitops_1.0-7           ggtree_3.6.2          
  [4] enrichplot_1.18.3      bit64_4.0.5            HDO.db_0.99.1         
  [7] RColorBrewer_1.1-3     httr_1.4.5             GenomeInfoDb_1.34.9   
 [10] tools_4.2.2            utf8_1.2.3             R6_2.5.1              
 [13] lazyeval_0.2.2         DBI_1.1.3              BiocGenerics_0.44.0   
 [16] colorspace_2.1-0       withr_2.5.0            tidyselect_1.2.0      
 [19] gridExtra_2.3          bit_4.0.5              compiler_4.2.2        
 [22] cli_3.6.0              Biobase_2.58.0         scatterpie_0.1.8      
 [25] shadowtext_0.1.2       scales_1.2.1           stringr_1.5.0         
 [28] digest_0.6.31          yulab.utils_0.0.6      gson_0.0.9            
 [31] DOSE_3.24.2            XVector_0.38.0         pkgconfig_2.0.3       
 [34] fastmap_1.1.1          rlang_1.0.6            RSQLite_2.3.0         
 [37] gridGraphics_0.5-1     farver_2.1.1           generics_0.1.3        
 [40] jsonlite_1.8.4         BiocParallel_1.32.5    GOSemSim_2.24.0       
 [43] dplyr_1.1.0            RCurl_1.98-1.10        magrittr_2.0.3        
 [46] ggplotify_0.1.0        GO.db_3.16.0           GenomeInfoDbData_1.2.9
 [49] patchwork_1.1.2        Matrix_1.5-3           Rcpp_1.0.10           
 [52] munsell_0.5.0          S4Vectors_0.36.2       fansi_1.0.4           
 [55] ape_5.7                viridis_0.6.2          lifecycle_1.0.3       
 [58] stringi_1.7.12         ggraph_2.1.0           MASS_7.3-58.2         
 [61] zlibbioc_1.44.0        org.Hs.eg.db_3.16.0    plyr_1.8.8            
 [64] qvalue_2.30.0          grid_4.2.2             blob_1.2.3            
 [67] parallel_4.2.2         ggrepel_0.9.3          crayon_1.5.2          
 [70] lattice_0.20-45        graphlayouts_0.8.4     Biostrings_2.66.0     
 [73] cowplot_1.1.1          splines_4.2.2          KEGGREST_1.38.0       
 [76] pillar_1.8.1           fgsea_1.24.0           igraph_1.4.1          
 [79] reshape2_1.4.4         codetools_0.2-19       stats4_4.2.2          
 [82] fastmatch_1.1-3        glue_1.6.2             ggfun_0.0.9           
 [85] downloader_0.4         data.table_1.14.8      treeio_1.22.0         
 [88] png_0.1-8              vctrs_0.5.2            tweenr_2.0.2          
 [91] gtable_0.3.1           purrr_1.0.1            polyclip_1.10-4       
 [94] tidyr_1.3.0            cachem_1.0.7           ggplot2_3.4.1         
 [97] ggforce_0.4.1          tidygraph_1.2.3        tidytree_0.4.2        
[100] viridisLite_0.4.1      tibble_3.1.8           aplot_0.1.9           
[103] AnnotationDbi_1.60.0   memoise_2.0.1          IRanges_2.32.0        
> 
>
Leonrunning commented 1 year ago

thanks for your response, but probably, I misunderstood the calculation of GeneRatio M/N. N is the total number of genes detected in the gene list not of a specific gene set, right? If so, the N is the number of genes provided in our gene list. It should be the same.

Then my issue is the number is quite low than what I provided. I provided more than 600 genes and the number is only 100. The number of background genes in BgRatios is also quite low, only about 2000.

I tried Panther and it provides more genes than ClusterProfiler. Please see below image

guidohooiveld commented 1 year ago

Aha, I stand corrected!

You are right regarding the calculation of the gene ratio! I got confused, and edited my answer above.

Thus (in the example code from my post above) : 720 equals the number of genes from the provided list of 750 selected genes that could be annotated to any of the GO-BP categories (=denominator), and this value should indeed be the same for all categories. The numerator equals the number of genes in the provided list that have been annotated to a specific GO-BP category.

This ratio is then compared to the ratio of the whole GO-BP to check for statistical significant overrepresentation of a GO-BP category in the list of selected genes.

Leonrunning commented 1 year ago

Thanks, but do you know why

1) I provided more than 600 genes in my gene list and only 100 of them were used in the GeneRatio. It seems like it also happened to https://github.com/YuLab-SMU/clusterProfiler/issues/527 (98 or 100 genes). 2) The number of background genes in BgRatios is also quite low, only about 2000. You have about 18903 genes in the above data for human.

Are there any issues with enrichGO or it could be an issue with Org.db of chicken?