YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.02k stars 256 forks source link

Question about GO terms in clusterProfiler #720

Open shuaizh117 opened 2 months ago

shuaizh117 commented 2 months ago

I was confused by GO terms popped out in my GO analysis using enrichGO. For example, "GO:0141091: transforming growth factor beta receptor superfamily signaling pathway" was reported as one of the top overrepresented GO in my sig DEGs. And below is the gene list "HSPA1A/TNFAIP6/FST/ZBTB7A/HES1/ITGB8/TGFB2/MYOCD/LATS2/THBS1/GREM1" enriched to this term. However, when I refer to PANTHER regarding GO:0141091, I only found TGFB2 in this GO term. If the other genes do not belong to this GO:0141091, then how did this GO get enriched?? I think I'm still not very clear about the behind scene story of how enrichGO perform the calculation.

sessionInfo() R version 4.4.0 (2024-04-24 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/Chicago tzcode source: internal

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] viridis_0.6.5 viridisLite_0.4.2
[3] topGO_2.56.0 SparseM_1.81
[5] GO.db_3.19.1 graph_1.82.0
[7] pathview_1.44.0 enrichplot_1.24.0
[9] clusterProfiler_4.12.3 ggpubr_0.6.0
[11] corrplot_0.92 annotables_0.2.0
[13] RColorBrewer_1.1-3 ggrepel_0.9.5
[15] pheatmap_1.0.12 lubridate_1.9.3
[17] forcats_1.0.0 stringr_1.5.1
[19] dplyr_1.1.4 purrr_1.0.2
[21] readr_2.1.5 tidyr_1.3.1
[23] tibble_3.2.1 tidyverse_2.0.0
[25] ggplot2_3.5.1 org.Hs.eg.db_3.19.1
[27] AnnotationDbi_1.66.0 DESeq2_1.44.0
[29] SummarizedExperiment_1.34.0 Biobase_2.64.0
[31] MatrixGenerics_1.16.0 matrixStats_1.3.0
[33] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
[35] IRanges_2.38.0 S4Vectors_0.42.0
[37] BiocGenerics_0.50.0

loaded via a namespace (and not attached): [1] rstudioapi_0.16.0 jsonlite_1.8.8 magrittr_2.0.3
[4] farver_2.1.1 rmarkdown_2.26 fs_1.6.4
[7] zlibbioc_1.50.0 vctrs_0.6.5 memoise_2.0.1
[10] RCurl_1.98-1.14 ggtree_3.12.0 rstatix_0.7.2
[13] htmltools_0.5.8.1 S4Arrays_1.4.0 broom_1.0.5
[16] SparseArray_1.4.3 gridGraphics_0.5-1 plyr_1.8.9
[19] httr2_1.0.1 cachem_1.0.8 igraph_2.0.3
[22] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.7-0
[25] R6_2.5.1 fastmap_1.1.1 gson_0.1.0
[28] GenomeInfoDbData_1.2.12 digest_0.6.31 aplot_0.2.2
[31] colorspace_2.1-0 patchwork_1.2.0 RSQLite_2.3.6
[34] labeling_0.4.3 timechange_0.3.0 fansi_1.0.4
[37] httr_1.4.7 polyclip_1.10-6 abind_1.4-5
[40] compiler_4.4.0 bit64_4.0.5 withr_3.0.0
[43] backports_1.4.1 BiocParallel_1.38.0 carData_3.0-5
[46] DBI_1.2.2 ggforce_0.4.2 ggsignif_0.6.4
[49] MASS_7.3-60.2 rappdirs_0.3.3 DelayedArray_0.30.1
[52] HDO.db_0.99.1 tools_4.4.0 ape_5.8
[55] scatterpie_0.2.2 glue_1.6.2 nlme_3.1-164
[58] GOSemSim_2.30.0 grid_4.4.0 shadowtext_0.1.3
[61] reshape2_1.4.4 fgsea_1.30.0 generics_0.1.3
[64] gtable_0.3.5 tzdb_0.4.0 hms_1.1.3
[67] data.table_1.15.4 car_3.1-2 tidygraph_1.3.1
[70] utf8_1.2.3 XVector_0.44.0 pillar_1.9.0
[73] yulab.utils_0.1.6 limma_3.60.0 splines_4.4.0
[76] tweenr_2.0.3 treeio_1.28.0 lattice_0.22-6
[79] bit_4.0.5 tidyselect_1.2.1 locfit_1.5-9.9
[82] Biostrings_2.72.0 knitr_1.46 gridExtra_2.3
[85] xfun_0.43 graphlayouts_1.1.1 statmod_1.5.0
[88] KEGGgraph_1.64.0 stringi_1.8.4 UCSC.utils_1.0.0
[91] lazyeval_0.2.2 ggfun_0.1.4 yaml_2.3.7
[94] evaluate_0.23 codetools_0.2-20 ggraph_2.2.1
[97] qvalue_2.36.0 Rgraphviz_2.48.0 ggplotify_0.1.2
[100] cli_3.6.2 munsell_0.5.1 Rcpp_1.0.12
[103] png_0.1-8 XML_3.99-0.16.1 parallel_4.4.0
[106] blob_1.2.4 DOSE_3.30.0 bitops_1.0-7
[109] tidytree_0.4.6 scales_1.3.0 crayon_1.5.2
[112] rlang_1.1.3 cowplot_1.1.3 fastmatch_1.1-4
[115] KEGGREST_1.44.0

ifitsok commented 2 months ago

I came across the same problem, does anyone know what's going on?

The enrichment result was got; but when I used the genes enriched to look up to their original GO ID, I just couldn't find the GO IDs that the software provided from the original GO IDs of the genes.

ifitsok commented 2 months ago

Hey bro,

https://www.ebi.ac.uk/QuickGO/. Maybe you can go to this website and search your GO IDs, the hitten GOID might be the parent term of the annotated GOID for your genes. (Although there is some exclusion).

guidohooiveld commented 2 months ago

enrichGO() (and other GO-based analyses) uses the Gene Ontology information present in the Bioconductor package GO.db. (here).

To see which genes have been annotated to a specific GO category in the GO.db package you can for example do: (note the use of keytype = "GOALL", and not (just) "GO"!)

> library(org.Hs.eg.db)
> library(GO.db)
> 
> ## query
> results <- AnnotationDbi::select(org.Hs.eg.db, keys=c("GO:0141091"), columns = c('ENTREZID','SYMBOL'), keytype = "GOALL")
'select()' returned 1:many mapping between keys and columns
> 
> dim(results)
[1] 592   5
> head(results)
       GOALL EVIDENCEALL ONTOLOGYALL ENTREZID SYMBOL
1 GO:0141091         IEA          BP       25   ABL1
2 GO:0141091         IBA          BP       90  ACVR1
3 GO:0141091         IDA          BP       90  ACVR1
4 GO:0141091         IMP          BP       90  ACVR1
5 GO:0141091         ISS          BP       90  ACVR1
6 GO:0141091         IBA          BP       91 ACVR1B
> 
> ## remove duplicates 
> results <- results[!duplicated(results[,"ENTREZID"]), ]
> 
> dim(results)
[1] 388   5
> head(results)
        GOALL EVIDENCEALL ONTOLOGYALL ENTREZID SYMBOL
1  GO:0141091         IEA          BP       25   ABL1
2  GO:0141091         IBA          BP       90  ACVR1
6  GO:0141091         IBA          BP       91 ACVR1B
10 GO:0141091         IBA          BP       92 ACVR2A
13 GO:0141091         IBA          BP       93 ACVR2B
17 GO:0141091         IBA          BP       94 ACVRL1
> 

So 388 genes have been annotated to this particular GO category.

Manually confirm that all enriched genes are in these as well? Note that last gene is not! (I just added GAPDH as example). Answer: YES!

> enriched.genes <- c("HSPA1A","TNFAIP6","FST","ZBTB7A","HES1","ITGB8",
+                     "TGFB2","MYOCD","LATS2","THBS1","GREM1","GAPDH")
> enriched.genes %in% results[,"SYMBOL"]
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
>

As said, enrichGO uses GOALL, so also taking into account the parent-child relationships, but maybe PANTHER does not? Therefore only directly annotated genes are retrieved? In this context, see also this recent discussion: https://github.com/YuLab-SMU/clusterProfiler/issues/697#issuecomment-2189017067

Also note that although the last release of PANTHER (=19) is from June 2024, the underlying GO annotations used for this release seem to be from May 2023 (thus a year old). https://www.pantherdb.org/news/news20240620.jsp

Yet, the source of the GO.db is2024-Mar12.

> GO.db
GODb object:
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2024-01-17
| Db type: GODb
| package: AnnotationDbi
| DBSCHEMA: GO_DB
| GOEGSOURCEDATE: 2024-Mar12
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| DBSCHEMAVERSION: 2.1

Please see: help('select') for usage information
> 

Both could explain your observation.