How to implement treeplot to genes

YuLab-SMU / enrichplot

Visualization of Functional Enrichment Result

https://yulab-smu.top/biomedical-knowledge-mining-book/

232 stars 65 forks source link

How to implement treeplot to genes #241

Open Jwenyi opened 1 year ago

Jwenyi commented 1 year ago

Hi, I'm a phd student from cityU, HK. I found your 'treeplot' is of greatly interesting to us when we need to make in-depth insights into numerous GO-enriched terms. However, I'm curious about the method you used in 'treeplot' to find the co-ancestor of some GO terms, cuz we are trying to cluster genes based on their semantic similarity and then yield cluster-level terms that could represent the common biological participation of each cluster. So I'm wondering whether treeplot could be modified for this. Best, Wenyi

altairwei commented 1 year ago

In fact, the clustering of GO terms in treeplot is based on pairwise_termsim, which calculates the similarity between GO terms by Jaccard index, i.e., the similarity of the set of genes enriched in two GO terms. However, pairwise_termsim can also calculate similarity based on GO semantics.

Jwenyi commented 1 year ago

Thank you for your reply. I apologize if I didn't explain clearly, which led to some misunderstanding. I have read the source code of the treeplot module, and I did find that it first clusters GO terms based on semantic similarity and then provides a unique 'biological description' for each cluster. I would like to know how treeplot determines the 'biological description' for each cluster. Does it search for common parent nodes among these GO terms or use some other method?

altairwei commented 1 year ago

It's just a word cloud, see here:

add_cladelab <- function(p, nWords, label_format_cladelab, 
                         offset, roots, 
                         fontsize, group_color, cluster_color, 
                         pdata, extend, hilight, align) {
    # align <- getOption("enriplot.treeplot.align", default = "both")
    cluster_label <- sapply(cluster_color, get_wordcloud, ggData = pdata,
                        nWords = nWords)

I noticed a better way had been implemented in {aPEAR} package, see https://doi.org/10.1101/2023.03.28.534514

Each cluster is assigned a biologically meaningful name. The most important pathway in each cluster is determined using either PageRank (Page et al. 1999) (default) or HITS (Kleinberg 1999) algorithm that examines the connectivity within the cluster and detects the most important pathway. The description of this pathway is used as the name of the cluster.

huerqiang commented 1 year ago

We are trying to extract the information from the cluster more appropriately. But the direct use of the most significant pathway name as the cluster name may lose a lot of information. If you have any better suggestions, you are welcome to discuss them with us, and we will use them to improve treeplot, emapplot_cluster, and so on. Thanks.