YuLab-SMU / ProjectYulab

:next_track_button: Small coding tasks that enable you to participate in our development
33 stars 3 forks source link

Cluster name of emapplot_cluster() in enrichplot package #7

Open huerqiang opened 1 year ago

huerqiang commented 1 year ago

We now use wordcloud as the cluster name of emapplot_cluster().

    library(DOSE)
    data(geneList)
    de <- names(geneList)[1:100]
    x <- enrichDO(de)
    x2 <- pairwise_termsim(x)
    emapplot_cluster(x2)

image But it is not good enough: https://github.com/YuLab-SMU/enrichplot/issues/241#issuecomment-1543504229

Please give a better way to display cluster information. You can get the code of wordcloud here: https://github.com/YuLab-SMU/enrichplot/blob/master/R/wordcloud.R

Potato-tudou commented 1 year ago

Question: is there any way to extract the clustering information from the emapplot() easily? I'm struggling on this for days... 😢😢😢

Potato-tudou commented 1 year ago

Theoretically, the understandability of the cluster information is determined by the number of keyword that being displayed, that is, the more keywords are shown, the more understandable the cluster would be. So I think we can leave the choice to users and let them determine how many keywords could be shown. Here's my example:

rm(list = ls())
library(DOSE)
library(enrichplot)
library(reshape2)
library(igraph)
library(magrittr)
data(geneList)
de <- names(geneList)[1:100]
x <- enrichDO(de)
x2 <- pairwise_termsim(x)
#############################################

x3 <- as.data.frame(x2)
x4 <- x2@termsim[as.character(x3$Description),as.character(x3$Description)]
w <- melt(x4)
wd <- w[w[,1] != w[,2],] %>% na.omit()
wd <- wd[wd$value != 0,]
##
g <- graph.data.frame(wd[, -3], directed=FALSE)
E(g)$value <- wd[, 3]
## calculate the number of clusters
centers_g <- ceiling(sqrt(nrow(x4)))
k_means <- kmeans(get.adjacency(g), centers = centers_g)
#### get the information of a certain cluster
info_n <- k_means$cluster[k_means$cluster==3] %>% names() # the 3rd cluster, for instance

## borrowing the word frequency function from @huerqiang 
get_word_freq <- function(wordd){     
  dada <- strsplit(wordd, " ")
  didi <- table(unlist(dada))
  didi <- didi[order(didi, decreasing = TRUE)]
  # Get the number of each word
  word_name <- names(didi)
  fun_num_w <- function(ww){
    sum(vapply(dada, function(w){ww %in% w}, FUN.VALUE = 1))
  }
  word_num <- vapply(word_name, fun_num_w, FUN.VALUE = 1)
  word_w <- word_num[order(word_num, decreasing = TRUE)]
}
##

#### how many keywords you wanna show? take 80% as an example~
info_cluster <- get_word_freq(info_n)[1:(0.8*length(get_word_freq(info_n)))] %>% names()

It's still not so perfect, but now we can have a clearer clue for understanding cluster information.