YuLab-SMU / enrichplot

Visualization of Functional Enrichment Result
https://yulab-smu.top/biomedical-knowledge-mining-book/
226 stars 64 forks source link

gene name in cnetplot--with own GO annotation data #275

Open kersng opened 4 months ago

kersng commented 4 months ago

Hi, I have my own GO annotation data with multiple treatment groups for comparison, and so I did the comparecluster with 'enricher' function. I would like to visualize the result in cnetplot with gene name. Do you know how to change the gene ID into the gene name in this case? I tried the 'setReadable' function but it was not working since my microorganism is not from the Org database.

f=compareCluster(GeneID~group+treatment, data = big_df, fun="enricher", TERM2GENE = go_gene, TERM2NAME = go_terms, pvalueCutoff = 0.05, pAdjustMethod = "BH")

Thank you! :)

guidohooiveld commented 3 months ago

In case you were not able to solve this yet:

The function setReadable is indeed normally used to do this, but as you said it depends on an OrgDb. Yet, the nice thing of open-source software is that... the source (code) is open to everyone!

So you could lookup the code of setReadable and tweak it according to your use case. See below for an example. Note that setReadable is part of DOSE. https://github.com/YuLab-SMU/DOSE/blob/devel/R/setReadable.R

Key is line 53 of setReadable: https://github.com/YuLab-SMU/DOSE/blob/98301d9860df116cd89852c660cd0fd132efac2e/R/setReadable.R#L53 gn <- EXTID2NAME(OrgDb, genes, keyType))

In this line 53 the gene ids (= genes) are converted in symbols, and stored in object gn. Note that gn is just a simple character vector (with names that are the gene ids!).

Thus: if you create such character vector, independent from an OrgDb, then you are set!

To illustrate the above let's get started by generating a compareClusterResult using the example human dataset; note that I generate a compareClusterResult for KEGG-based gene sets, but in order to show the concept that doesn't matter. The essence is that the input are entrezid, that are then converted into gene symbols.

> ## load libraries
> library(clusterProfiler)
> library(org.Hs.eg.db)
> 
> ##load example data
> data(gcSample)
> 
> ## run enrichKEGG on all clusters
> ck <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
> 
> ## check output
> ck
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichKEGG 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   76 obs. of  15 variables:
 $ Cluster       : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
 $ category      : chr  "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
 $ subcategory   : chr  "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
 $ ID            : chr  "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
 $ Description   : chr  "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
 $ GeneRatio     : chr  "23/410" "8/410" "18/410" "9/195" ...
 $ BgRatio       : chr  "203/8842" "38/8842" "158/8842" "89/8842" ...
 $ RichFactor    : num  0.113 0.211 0.114 0.101 0.057 ...
 $ FoldEnrichment: num  2.44 4.54 2.46 4.59 2.59 ...
 $ zScore        : num  4.59 4.82 4.07 5.1 4.18 ...
 $ pvalue        : num  6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
 $ p.adjust      : num  0.0198 0.0383 0.0383 0.0358 0.0358 ...
 $ qvalue        : num  0.0184 0.0357 0.0357 0.0328 0.0328 ...
 $ geneID        : chr  "4067/3383/7128/1869/890/1871/578/864/637/9641/6891/355/9134/5971/916/956/6850/7187/3551/919/4734/958/6772" "100/6891/3932/973/916/925/958/64421" "991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173" "7057/3339/1299/3695/1101/3679/3910/3696/3693" ...
 $ Count         : int  23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 0 
#..   X2: 3 
#..   X3: 3 
#..   X4: 21 
#..   X5: 9 
#..   X6: 1 
#..   X7: 18 
#..   X8: 21 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

>

The challenge is now to be able to manually convert the gene ids into symbols.

To do this a character vector is needed that 'links' the gene ids to symbols (i.e. my object id2symbol). Note that the names of the vector are the gene ids.

Note also that since the example dataset are human entrezids, the human OrgDb can be used to generate such character vector. This is what I do now. In your case you should generate such id-to-symbol vector manually.

> ## create id-to-symbol mapping vector for all gene ids
> ## first extract all gene ids from annotation database
> all.genes <- keys(org.Hs.eg.db)
> 
> ## then map all.genes (entrezid) to symbols
> ## note that the rownames of id2symbol are the (input) entrezids
> ## within the function setReadable 'id2symbol' corresponds to the output of the function 'EXTID2NAME', but then for all genes
> ## 'gn' is generated on line 53 (gn <- EXTID2NAME(OrgDb, genes, keyType) )
> id2symbol <-  mapIds(org.Hs.eg.db, keys=all.genes, column="SYMBOL", keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns
> 
> class(id2symbol)
[1] "character"
> head(id2symbol)
      1       2       3       9      10      11 
 "A1BG"   "A2M" "A2MP1"  "NAT1"  "NAT2"  "NATP" 
> tail(id2symbol)
  133395148   133395149   133395150   133834869   134006662   134198642 
   "SKP1P4"    "AMANZI"   "LNCARGI"     "MLDHR" "SSU72-AS1"      "PSS3" 
>

Continue by running step-by-step the relevant parts of the code of setReadable:

> ## extract all gene ids from result object
> ## note that function geneInCategory is part of DOSE
> ## line 38 code
> gc <- geneInCategory(ck)
> 
> ## check whether compareCluster is based on ORA or GSEA analysis
> ## in case of latter, column named "core_enrichment" is present in result object
> ## lines 42-48
> if ("core_enrichment" %in% colnames(as.data.frame(ck))) {
+             geneslist <- ck@geneClusters
+             names(geneslist) <- NULL
+             genes <- unique(names(unlist(geneslist)))
+         } else {
+             genes <- unique(unlist(ck@geneClusters))
+         }     
>

Now the key step: modify line 53 to use id2symbol for all genes in the result object.

> # gn <- EXTID2NAME(org.Hs.eg.db, genes, keyType) ## original code line 53
> gn <- id2symbol[genes]

Continue with code.

> ## lines 57-69
> gc2 <- list()
> k <- 1
> for(i in seq_len(length(gc))) {
+     for(j in seq_len(length(gc[[i]]))) {
+         gc2[[k]] <- gc[[i]][[j]]
+         names(gc2)[k] <- paste(names(gc)[[i]], names(gc[[i]])[j], sep="-")
+         k <- k + 1
+     }
+ }
> gc <- gc2
> gc <- lapply(gc, function(i) gn[i])
> res <- ck@compareClusterResult
> gc <- gc[paste(res$Cluster, res$ID, sep= "-")]
> ## prepare converted ids to be put on result object
> ## line 81
> geneID <- sapply(gc, paste0, collapse="/")
>
> ## line 86
> res$geneID <- unlist(geneID)
>
> ## complete the output that added to the returned object
> ## lines 88-92
> ck@gene2Symbol <- gn
> ck@keytype <- keyType
> ck@readable <- TRUE
> ck@compareClusterResult <- res
> 

Check:

> ck
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichKEGG 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   76 obs. of  15 variables:
 $ Cluster       : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
 $ category      : chr  "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
 $ subcategory   : chr  "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
 $ ID            : chr  "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
 $ Description   : chr  "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
 $ GeneRatio     : chr  "23/410" "8/410" "18/410" "9/195" ...
 $ BgRatio       : chr  "203/8842" "38/8842" "158/8842" "89/8842" ...
 $ RichFactor    : num  0.113 0.211 0.114 0.101 0.057 ...
 $ FoldEnrichment: num  2.44 4.54 2.46 4.59 2.59 ...
 $ zScore        : num  4.59 4.82 4.07 5.1 4.18 ...
 $ pvalue        : num  6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
 $ p.adjust      : num  0.0198 0.0383 0.0383 0.0358 0.0358 ...
 $ qvalue        : num  0.0184 0.0357 0.0357 0.0328 0.0328 ...
 $ geneID        : chr  "LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1" "ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C" "CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4" "THBS1/HSPG2/COL9A3/ITGB7/CHAD/ITGA7/LAMA4/ITGB8/ITGB5" ...
 $ Count         : int  23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 0 
#..   X2: 3 
#..   X3: 3 
#..   X4: 21 
#..   X5: 9 
#..   X6: 1 
#..   X7: 18 
#..   X8: 21 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> as.data.frame(ck)[1:3,]
  Cluster           category               subcategory       ID
1      X2     Human Diseases Infectious disease: viral hsa05169
2      X2     Human Diseases            Immune disease hsa05340
3      X2 Cellular Processes     Cell growth and death hsa04110
                   Description GeneRatio  BgRatio RichFactor FoldEnrichment
1 Epstein-Barr virus infection    23/410 203/8842  0.1133005       2.443422
2     Primary immunodeficiency     8/410  38/8842  0.2105263       4.540180
3                   Cell cycle    18/410 158/8842  0.1139241       2.456869
    zScore       pvalue   p.adjust     qvalue
1 4.587614 6.281651e-05 0.01978720 0.01844822
2 4.822302 2.860603e-04 0.03828932 0.03569832
3 4.074427 3.646602e-04 0.03828932 0.03569832
                                                                                                                         geneID
1 LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
2                                                                                     ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C
3                          CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4
  Count
1    23
2     8
3    18
> 

Compare with output from setReadable()

> ck.again <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
> ck.readable <- setReadable(ck.again, org.Hs.eg.db, "ENTREZID")
> 
> ck.readable
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichKEGG 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   76 obs. of  15 variables:
 $ Cluster       : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
 $ category      : chr  "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
 $ subcategory   : chr  "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
 $ ID            : chr  "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
 $ Description   : chr  "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
 $ GeneRatio     : chr  "23/410" "8/410" "18/410" "9/195" ...
 $ BgRatio       : chr  "203/8842" "38/8842" "158/8842" "89/8842" ...
 $ RichFactor    : num  0.113 0.211 0.114 0.101 0.057 ...
 $ FoldEnrichment: num  2.44 4.54 2.46 4.59 2.59 ...
 $ zScore        : num  4.59 4.82 4.07 5.1 4.18 ...
 $ pvalue        : num  6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
 $ p.adjust      : num  0.0198 0.0383 0.0383 0.0358 0.0358 ...
 $ qvalue        : num  0.0184 0.0357 0.0357 0.0328 0.0328 ...
 $ geneID        : chr  "LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1" "ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C" "CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4" "THBS1/HSPG2/COL9A3/ITGB7/CHAD/ITGA7/LAMA4/ITGB8/ITGB5" ...
 $ Count         : int  23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 0 
#..   X2: 3 
#..   X3: 3 
#..   X4: 21 
#..   X5: 9 
#..   X6: 1 
#..   X7: 18 
#..   X8: 21 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> as.data.frame(ck.readable)[1:3,]
  Cluster           category               subcategory       ID
1      X2     Human Diseases Infectious disease: viral hsa05169
2      X2     Human Diseases            Immune disease hsa05340
3      X2 Cellular Processes     Cell growth and death hsa04110
                   Description GeneRatio  BgRatio RichFactor FoldEnrichment
1 Epstein-Barr virus infection    23/410 203/8842  0.1133005       2.443422
2     Primary immunodeficiency     8/410  38/8842  0.2105263       4.540180
3                   Cell cycle    18/410 158/8842  0.1139241       2.456869
    zScore       pvalue   p.adjust     qvalue
1 4.587614 6.281651e-05 0.01978720 0.01844822
2 4.822302 2.860603e-04 0.03828932 0.03569832
3 4.074427 3.646602e-04 0.03828932 0.03569832
                                                                                                                         geneID
1 LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
2                                                                                     ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C
3                          CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4
  Count
1    23
2     8
3    18
> 
kersng commented 3 months ago

Hi, thank you so much for the detail demonstration :)