YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
967 stars 246 forks source link

How to use compareCluster for GSEA on the Msigdb database? #653

Open Ldec12 opened 6 months ago

Ldec12 commented 6 months ago

Hello, I am using compareCluster for multi-group GSEA. When using the org.Rn.eg.db database, I set fun=gseGO. If using the Msigdb database, how should I configure it? Appreciate your help.

guidohooiveld commented 6 months ago

Have you seen this information: http://yulab-smu.top/biomedical-knowledge-mining-book/universal-api.html#msigdb-analysis ?

Ldec12 commented 6 months ago

"I have read it. What I mean is using the compareCluster function for analysis, Can the 'fun' parameter accept the value 'GSEA'? I would greatly appreciate your assistance." for example,

compareCluster(geneList, fun="GSEA", TERM2GENE=m_df, pvalueCutoff=0.05)

thank you

guidohooiveld commented 6 months ago

Yes, that is indeed the way to do it. Note that you can include all arguments that you could set when analyzing a single dataset.

Using the included, human example dataset:

> library(msigdbr)
> library(clusterProfiler)
> library(enrichplot)
> 
> 
> ## using the code from the biomedical knowledge mining book.
> ## retrieve all human MSigDb gene sets for category C8 (cell type signature gene sets)
> m_dfC8 <- msigdbr(species = "Homo sapiens", category = "C8") %>%
+         dplyr::select(gs_name, entrez_gene)
> head(m_dfC8, 2)
# A tibble: 2 × 2
  gs_name                    entrez_gene
  <chr>                            <int>
1 AIZARANI_LIVER_C10_MVECS_1           2
2 AIZARANI_LIVER_C10_MVECS_1        2532
> 
> ## load sample data, and create 3 random sets of sign. regulated genes.
> ## Note that these are human ids.
> data(geneList, package = "DOSE")
> de1 <- names(geneList)[1:500]
> de2 <- names(geneList)[301:800]
> de3 <- names(geneList)[401:600]
> 
> ## combine all lists of significant genes into a list, and check
> cp.input.ora <- list("Exp. 1"=de1, "Exp. 2"=de2, "Exp. 3"=de3)
> str(cp.input.ora)
List of 3
 $ Exp. 1: chr [1:500] "4312" "8318" "10874" "55143" ...
 $ Exp. 2: chr [1:500] "22948" "768" "79315" "1230" ...
 $ Exp. 3: chr [1:200] "3429" "80830" "23590" "10397" ...
> 
> ## Perform overrepresentation analysis using generic function enricher.
> ## Note that no significance cutoff is implemented, because random gene sets
> ## are used, and this is just to illustrate the code as such.
> ##
> ## Next, just to be sure calculate similarity matrices of terms, and convert ids
> ## into symbols
> MSigDb.ora <- compareCluster(geneClusters=cp.input.ora,  fun = "enricher", TERM2GENE=m_dfC8,
+                              pvalueCutoff = 1, pAdjustMethod = "none", minGSSize = 10, maxGSSize = 500)
> MSigDb.ora <- enrichplot::pairwise_termsim(MSigDb.ora) 
> MSigDb.ora <- setReadable(MSigDb.ora, 'org.Hs.eg.db', 'ENTREZID')
> MSigDb.ora
#
# Result of Comparing 3 gene clusters 
#
#.. @fun         enricher 
#.. @geneClusters       List of 3
 $ Exp. 1: chr [1:500] "4312" "8318" "10874" "55143" ...
 $ Exp. 2: chr [1:500] "22948" "768" "79315" "1230" ...
 $ Exp. 3: chr [1:200] "3429" "80830" "23590" "10397" ...
#...Result      'data.frame':   513 obs. of  10 variables:
 $ Cluster    : Factor w/ 3 levels "Exp. 1","Exp. 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID         : chr  "ZHONG_PFC_C1_OPC" "FAN_EMBRYONIC_CTX_NSC_2" "FAN_EMBRYONIC_CTX_MICROGLIA_1" "MANNO_MIDBRAIN_NEUROTYPES_HPROGBP" ...
 $ Description: chr  "ZHONG_PFC_C1_OPC" "FAN_EMBRYONIC_CTX_NSC_2" "FAN_EMBRYONIC_CTX_MICROGLIA_1" "MANNO_MIDBRAIN_NEUROTYPES_HPROGBP" ...
 $ GeneRatio  : chr  "96/473" "94/473" "75/473" "88/473" ...
 $ BgRatio    : chr  "238/19170" "233/19170" "155/19170" "300/19170" ...
 $ pvalue     : num  2.87e-92 2.67e-90 3.13e-79 3.37e-70 8.48e-67 ...
 $ p.adjust   : num  2.87e-92 2.67e-90 3.13e-79 3.37e-70 8.48e-67 ...
 $ qvalue     : num  1.12e-89 5.18e-88 4.05e-77 3.27e-68 6.59e-65 ...
 $ geneID     : chr  "NMU/CDCA8/CDC20/FOXM1/KIF23/CENPE/MELK/CCNB2/NDC80/TOP2A/NCAPH/ASPM/RRM2/CEP55/DLGAP5/UBE2C/HJURP/SKA1/PBK/NUSA"| __truncated__ "NMU/CDCA8/CDC20/FOXM1/KIF23/CENPE/MELK/CCNB2/NDC80/TOP2A/NCAPH/ASPM/RRM2/CEP55/DLGAP5/UBE2C/HJURP/SKA1/PBK/NUSA"| __truncated__ "CDCA8/CDC20/FOXM1/KIF23/CENPE/MELK/CCNB2/NDC80/TOP2A/NCAPH/ASPM/RRM2/CEP55/DLGAP5/UBE2C/HJURP/PBK/NUSAP1/CDCA3/"| __truncated__ "CDCA8/FOXM1/KIF23/CENPE/MYBL2/MELK/CCNB2/TOP2A/NCAPH/ASPM/RRM2/DLGAP5/UBE2C/HJURP/PBK/NUSAP1/CDCA3/TPX2/TACC3/N"| __truncated__ ...
 $ Count      : int  96 94 75 88 63 63 85 88 53 65 ...
#.. number of enriched terms found for each gene cluster:
#..   Exp. 1: 184 
#..   Exp. 2: 206 
#..   Exp. 3: 123 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> 
> dotplot(MSigDb.ora, showCategory = 5, font.size = 6)
> 

test_ora

> ## along the same line, perform GSEA
> 
> ## load human example data; note that the input is 3x the same!
> cp.input.gsea <- list("Exp. 1"=geneList, "Exp. 2"=geneList, "Exp. 3"=geneList)
> str(cp.input.gsea)
List of 3
 $ Exp. 1: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
 $ Exp. 2: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
 $ Exp. 3: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
> 
> MSigDb.gsea <- compareCluster(geneClusters=cp.input.gsea,  fun = "GSEA",  eps = 0, TERM2GENE=m_dfC8,
+                              pvalueCutoff = 1, pAdjustMethod = "none", minGSSize = 10, maxGSSize = 500)
> MSigDb.gsea <- enrichplot::pairwise_termsim(MSigDb.gsea) 
> MSigDb.gsea <- setReadable(MSigDb.gsea, 'org.Hs.eg.db', 'ENTREZID')
> MSigDb.gsea
#
# Result of Comparing 3 gene clusters 
#
#.. @fun         GSEA 
#.. @geneClusters       List of 3
 $ Exp. 1: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
 $ Exp. 2: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
 $ Exp. 3: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
  ..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...Result      'data.frame':   1923 obs. of  12 variables:
 $ Cluster        : Factor w/ 3 levels "Exp. 1","Exp. 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID             : chr  "FAN_EMBRYONIC_CTX_NSC_2" "ZHONG_PFC_C1_OPC" "FAN_EMBRYONIC_CTX_MICROGLIA_1" "GAO_LARGE_INTESTINE_ADULT_CJ_IMMUNE_CELLS" ...
 $ Description    : chr  "FAN_EMBRYONIC_CTX_NSC_2" "ZHONG_PFC_C1_OPC" "FAN_EMBRYONIC_CTX_MICROGLIA_1" "GAO_LARGE_INTESTINE_ADULT_CJ_IMMUNE_CELLS" ...
 $ setSize        : int  185 186 130 392 114 71 200 236 220 257 ...
 $ enrichmentScore: num  0.822 0.81 0.848 -0.643 0.83 ...
 $ NES            : num  3.64 3.58 3.58 -2.83 3.45 ...
 $ pvalue         : num  8.72e-64 1.24e-59 2.67e-51 1.71e-46 2.82e-40 ...
 $ p.adjust       : num  8.72e-64 1.24e-59 2.67e-51 1.71e-46 2.82e-40 ...
 $ qvalue         : num  2.83e-61 2.02e-57 2.88e-49 1.39e-44 1.83e-38 ...
 $ rank           : num  937 937 937 1890 1044 ...
 $ leading_edge   : chr  "tags=59%, list=7%, signal=56%" "tags=61%, list=7%, signal=57%" "tags=68%, list=7%, signal=63%" "tags=50%, list=15%, signal=44%" ...
 $ core_enrichment: chr  "NMU/CDCA8/CDC20/FOXM1/KIF23/CENPE/MELK/CCNB2/NDC80/TOP2A/NCAPH/ASPM/RRM2/CEP55/DLGAP5/UBE2C/HJURP/SKA1/PBK/NUSA"| __truncated__ "NMU/CDCA8/CDC20/FOXM1/KIF23/CENPE/MELK/CCNB2/NDC80/TOP2A/NCAPH/ASPM/RRM2/CEP55/DLGAP5/UBE2C/HJURP/SKA1/PBK/NUSA"| __truncated__ "CDCA8/CDC20/FOXM1/KIF23/CENPE/MELK/CCNB2/NDC80/TOP2A/NCAPH/ASPM/RRM2/CEP55/DLGAP5/UBE2C/HJURP/PBK/NUSAP1/CDCA3/"| __truncated__ "COL4A6/TEK/RAMP3/NID1/DZIP1/LAMA4/EGFL7/GPX3/COL6A1/PTH1R/CRIP2/LIMCH1/SCD5/CD93/WWC3/IL1R1/SALL1/WNT5A/FILIP1L"| __truncated__ ...
#.. number of enriched terms found for each gene cluster:
#..   Exp. 1: 641 
#..   Exp. 2: 641 
#..   Exp. 3: 641 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> 
> 
> dotplot(MSigDb.gsea, font.size=8, showCategory=5, title =("GSEA C8 sets"), split=".sign") + facet_grid(.~.sign)
> 
> 

test_gsea

Ldec12 commented 6 months ago

Thank you very much for your help