Open kersng opened 4 months ago
In case you were not able to solve this yet:
The function setReadable
is indeed normally used to do this, but as you said it depends on an OrgDb
. Yet, the nice thing of open-source software is that... the source (code) is open to everyone!
So you could lookup the code of setReadable
and tweak it according to your use case. See below for an example. Note that setReadable
is part of DOSE
.
https://github.com/YuLab-SMU/DOSE/blob/devel/R/setReadable.R
Key is line 53 of setReadable
:
https://github.com/YuLab-SMU/DOSE/blob/98301d9860df116cd89852c660cd0fd132efac2e/R/setReadable.R#L53
gn <- EXTID2NAME(OrgDb, genes, keyType))
In this line 53 the gene ids (= genes
) are converted in symbols, and stored in object gn
. Note that gn
is just a simple character
vector (with names
that are the gene ids!).
Thus: if you create such character vector, independent from an OrgDb
, then you are set!
To illustrate the above let's get started by generating a compareClusterResult
using the example human dataset; note that I generate a compareClusterResult
for KEGG-based gene sets, but in order to show the concept that doesn't matter. The essence is that the input are entrezid
, that are then converted into gene symbols.
> ## load libraries
> library(clusterProfiler)
> library(org.Hs.eg.db)
>
> ##load example data
> data(gcSample)
>
> ## run enrichKEGG on all clusters
> ck <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
>
> ## check output
> ck
#
# Result of Comparing 8 gene clusters
#
#.. @fun enrichKEGG
#.. @geneClusters List of 8
$ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
$ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
$ X3: chr [1:392] "894" "7057" "22906" "3339" ...
$ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
$ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
$ X6: chr [1:585] "5337" "9295" "4035" "811" ...
$ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
$ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result 'data.frame': 76 obs. of 15 variables:
$ Cluster : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
$ category : chr "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
$ subcategory : chr "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
$ ID : chr "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
$ Description : chr "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
$ GeneRatio : chr "23/410" "8/410" "18/410" "9/195" ...
$ BgRatio : chr "203/8842" "38/8842" "158/8842" "89/8842" ...
$ RichFactor : num 0.113 0.211 0.114 0.101 0.057 ...
$ FoldEnrichment: num 2.44 4.54 2.46 4.59 2.59 ...
$ zScore : num 4.59 4.82 4.07 5.1 4.18 ...
$ pvalue : num 6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
$ p.adjust : num 0.0198 0.0383 0.0383 0.0358 0.0358 ...
$ qvalue : num 0.0184 0.0357 0.0357 0.0328 0.0328 ...
$ geneID : chr "4067/3383/7128/1869/890/1871/578/864/637/9641/6891/355/9134/5971/916/956/6850/7187/3551/919/4734/958/6772" "100/6891/3932/973/916/925/958/64421" "991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173" "7057/3339/1299/3695/1101/3679/3910/3696/3693" ...
$ Count : int 23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#.. X1: 0
#.. X2: 3
#.. X3: 3
#.. X4: 21
#.. X5: 9
#.. X6: 1
#.. X7: 18
#.. X8: 21
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou,
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
The Innovation. 2021, 2(3):100141
>
The challenge is now to be able to manually convert the gene ids into symbols.
To do this a character vector is needed that 'links' the gene ids to symbols (i.e. my object id2symbol
). Note that the names of the vector are the gene ids.
Note also that since the example dataset are human entrezids
, the human OrgDb
can be used to generate such character vector. This is what I do now. In your case you should generate such id-to-symbol vector manually.
> ## create id-to-symbol mapping vector for all gene ids
> ## first extract all gene ids from annotation database
> all.genes <- keys(org.Hs.eg.db)
>
> ## then map all.genes (entrezid) to symbols
> ## note that the rownames of id2symbol are the (input) entrezids
> ## within the function setReadable 'id2symbol' corresponds to the output of the function 'EXTID2NAME', but then for all genes
> ## 'gn' is generated on line 53 (gn <- EXTID2NAME(OrgDb, genes, keyType) )
> id2symbol <- mapIds(org.Hs.eg.db, keys=all.genes, column="SYMBOL", keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns
>
> class(id2symbol)
[1] "character"
> head(id2symbol)
1 2 3 9 10 11
"A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
> tail(id2symbol)
133395148 133395149 133395150 133834869 134006662 134198642
"SKP1P4" "AMANZI" "LNCARGI" "MLDHR" "SSU72-AS1" "PSS3"
>
Continue by running step-by-step the relevant parts of the code of setReadable
:
> ## extract all gene ids from result object
> ## note that function geneInCategory is part of DOSE
> ## line 38 code
> gc <- geneInCategory(ck)
>
> ## check whether compareCluster is based on ORA or GSEA analysis
> ## in case of latter, column named "core_enrichment" is present in result object
> ## lines 42-48
> if ("core_enrichment" %in% colnames(as.data.frame(ck))) {
+ geneslist <- ck@geneClusters
+ names(geneslist) <- NULL
+ genes <- unique(names(unlist(geneslist)))
+ } else {
+ genes <- unique(unlist(ck@geneClusters))
+ }
>
Now the key step: modify line 53 to use id2symbol
for all genes
in the result object.
> # gn <- EXTID2NAME(org.Hs.eg.db, genes, keyType) ## original code line 53
> gn <- id2symbol[genes]
Continue with code.
> ## lines 57-69
> gc2 <- list()
> k <- 1
> for(i in seq_len(length(gc))) {
+ for(j in seq_len(length(gc[[i]]))) {
+ gc2[[k]] <- gc[[i]][[j]]
+ names(gc2)[k] <- paste(names(gc)[[i]], names(gc[[i]])[j], sep="-")
+ k <- k + 1
+ }
+ }
> gc <- gc2
> gc <- lapply(gc, function(i) gn[i])
> res <- ck@compareClusterResult
> gc <- gc[paste(res$Cluster, res$ID, sep= "-")]
> ## prepare converted ids to be put on result object
> ## line 81
> geneID <- sapply(gc, paste0, collapse="/")
>
> ## line 86
> res$geneID <- unlist(geneID)
>
> ## complete the output that added to the returned object
> ## lines 88-92
> ck@gene2Symbol <- gn
> ck@keytype <- keyType
> ck@readable <- TRUE
> ck@compareClusterResult <- res
>
Check:
> ck
#
# Result of Comparing 8 gene clusters
#
#.. @fun enrichKEGG
#.. @geneClusters List of 8
$ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
$ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
$ X3: chr [1:392] "894" "7057" "22906" "3339" ...
$ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
$ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
$ X6: chr [1:585] "5337" "9295" "4035" "811" ...
$ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
$ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result 'data.frame': 76 obs. of 15 variables:
$ Cluster : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
$ category : chr "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
$ subcategory : chr "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
$ ID : chr "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
$ Description : chr "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
$ GeneRatio : chr "23/410" "8/410" "18/410" "9/195" ...
$ BgRatio : chr "203/8842" "38/8842" "158/8842" "89/8842" ...
$ RichFactor : num 0.113 0.211 0.114 0.101 0.057 ...
$ FoldEnrichment: num 2.44 4.54 2.46 4.59 2.59 ...
$ zScore : num 4.59 4.82 4.07 5.1 4.18 ...
$ pvalue : num 6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
$ p.adjust : num 0.0198 0.0383 0.0383 0.0358 0.0358 ...
$ qvalue : num 0.0184 0.0357 0.0357 0.0328 0.0328 ...
$ geneID : chr "LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1" "ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C" "CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4" "THBS1/HSPG2/COL9A3/ITGB7/CHAD/ITGA7/LAMA4/ITGB8/ITGB5" ...
$ Count : int 23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#.. X1: 0
#.. X2: 3
#.. X3: 3
#.. X4: 21
#.. X5: 9
#.. X6: 1
#.. X7: 18
#.. X8: 21
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou,
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
The Innovation. 2021, 2(3):100141
> as.data.frame(ck)[1:3,]
Cluster category subcategory ID
1 X2 Human Diseases Infectious disease: viral hsa05169
2 X2 Human Diseases Immune disease hsa05340
3 X2 Cellular Processes Cell growth and death hsa04110
Description GeneRatio BgRatio RichFactor FoldEnrichment
1 Epstein-Barr virus infection 23/410 203/8842 0.1133005 2.443422
2 Primary immunodeficiency 8/410 38/8842 0.2105263 4.540180
3 Cell cycle 18/410 158/8842 0.1139241 2.456869
zScore pvalue p.adjust qvalue
1 4.587614 6.281651e-05 0.01978720 0.01844822
2 4.822302 2.860603e-04 0.03828932 0.03569832
3 4.074427 3.646602e-04 0.03828932 0.03569832
geneID
1 LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
2 ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C
3 CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4
Count
1 23
2 8
3 18
>
Compare with output from setReadable()
> ck.again <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
> ck.readable <- setReadable(ck.again, org.Hs.eg.db, "ENTREZID")
>
> ck.readable
#
# Result of Comparing 8 gene clusters
#
#.. @fun enrichKEGG
#.. @geneClusters List of 8
$ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
$ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
$ X3: chr [1:392] "894" "7057" "22906" "3339" ...
$ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
$ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
$ X6: chr [1:585] "5337" "9295" "4035" "811" ...
$ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
$ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result 'data.frame': 76 obs. of 15 variables:
$ Cluster : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
$ category : chr "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
$ subcategory : chr "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
$ ID : chr "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
$ Description : chr "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
$ GeneRatio : chr "23/410" "8/410" "18/410" "9/195" ...
$ BgRatio : chr "203/8842" "38/8842" "158/8842" "89/8842" ...
$ RichFactor : num 0.113 0.211 0.114 0.101 0.057 ...
$ FoldEnrichment: num 2.44 4.54 2.46 4.59 2.59 ...
$ zScore : num 4.59 4.82 4.07 5.1 4.18 ...
$ pvalue : num 6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
$ p.adjust : num 0.0198 0.0383 0.0383 0.0358 0.0358 ...
$ qvalue : num 0.0184 0.0357 0.0357 0.0328 0.0328 ...
$ geneID : chr "LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1" "ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C" "CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4" "THBS1/HSPG2/COL9A3/ITGB7/CHAD/ITGA7/LAMA4/ITGB8/ITGB5" ...
$ Count : int 23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#.. X1: 0
#.. X2: 3
#.. X3: 3
#.. X4: 21
#.. X5: 9
#.. X6: 1
#.. X7: 18
#.. X8: 21
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou,
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
The Innovation. 2021, 2(3):100141
> as.data.frame(ck.readable)[1:3,]
Cluster category subcategory ID
1 X2 Human Diseases Infectious disease: viral hsa05169
2 X2 Human Diseases Immune disease hsa05340
3 X2 Cellular Processes Cell growth and death hsa04110
Description GeneRatio BgRatio RichFactor FoldEnrichment
1 Epstein-Barr virus infection 23/410 203/8842 0.1133005 2.443422
2 Primary immunodeficiency 8/410 38/8842 0.2105263 4.540180
3 Cell cycle 18/410 158/8842 0.1139241 2.456869
zScore pvalue p.adjust qvalue
1 4.587614 6.281651e-05 0.01978720 0.01844822
2 4.822302 2.860603e-04 0.03828932 0.03569832
3 4.074427 3.646602e-04 0.03828932 0.03569832
geneID
1 LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
2 ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C
3 CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4
Count
1 23
2 8
3 18
>
Hi, thank you so much for the detail demonstration :)
Hi, I have my own GO annotation data with multiple treatment groups for comparison, and so I did the comparecluster with 'enricher' function. I would like to visualize the result in cnetplot with gene name. Do you know how to change the gene ID into the gene name in this case? I tried the 'setReadable' function but it was not working since my microorganism is not from the Org database.
f=compareCluster(GeneID~group+treatment, data = big_df, fun="enricher", TERM2GENE = go_gene, TERM2NAME = go_terms, pvalueCutoff = 0.05, pAdjustMethod = "BH")
Thank you! :)