YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.02k stars 256 forks source link

New feature: simplify to parent GO terms #372

Open TylerSagendorf opened 3 years ago

TylerSagendorf commented 3 years ago

I propose to develop a new function that uses a combination of simplify and getGOLevel to reduce the redundancy of GO terms. The main idea is that if two terms have a high similarity, the more general term will be retained. This is likely to select the parent GO terms. The code that I use for this is provided below, though it will need to be modified slightly. I also recommend saving the list of GO terms at each level and including it as an object in clusterProfiler to speed up computation on the user's end.

# For each ontology, get a list of terms at each level from 1:14
# Note: This fails to get terms from level 15 and above for some reason.
# Levels are defined here: https://david.ncifcrf.gov/content.jsp?file=FAQs.html#9

onts <- c("BP", "CC", "MF")

# Create a list of 3 lists
ont_list <- lapply(onts, function(ont) {
  # For levels 1-14, get the GO terms
  lapply(1:14, function(level) {
    clusterProfiler:::getGOLevel(ont, level)
  })
})
# Set the names to easily access elements later
names(ont_list) <- onts 

# Recommended to save ont_list

# x is an object of class gseaResult, in this case
ont <- x@setType # Use x@setType for GSEA results and x@ontology for SEA results

# For each ID, get the levels where it appears.
# Also get the first (most general) level where it appears.
x %<>%
  mutate(
    # All levels
    GO_levels = sapply(ID, function(ID) {
      # Alternatively, use min(which(...)) to just get the most general level
      which(
        unlist(lapply(ont_list[[ont]], function(s) ID %in% s))
      )
    }),
    # Most general level (lowest value)
    first_GO_level = sapply(GO_levels, min)
  )

# If GO terms have a high similarity, select the more general one
x <- simplify(x, cutoff = 0.7, by = "first_GO_level", select_fun = min)
WJH58 commented 2 years ago

@TylerSagendorf Hi maybe this is an unrelated issue to your question, but which version did you use? I am using the latest version and I found getGOLevel() function was removed from this package. I have problem converting go id to go level.

TylerSagendorf commented 2 years ago

@TylerSagendorf Hi maybe this is an unrelated issue to your question, but which version did you use? I am using the latest version and I found getGOLevel() function was removed from this package. I have problem converting go id to go level.

I am using clusterProfiler 4.0.0, but getGOLevel is not an exported function. That is why I use the triple colon to access it. Also keep in mind that each ID may have multiple levels, depending on their relationship to other terms in the DAG.

Starahoush commented 2 years ago

Ty so much for this, as it is a very useful function! There are just 2 adjustmensts I had to do:

After that, it worked perfectly. Looking forward for a implementation in the package

ryuhjinahn commented 3 months ago

thank you very much for this. I modified it to make it run with an output from enrichGO (ego) and figured I could put it here in case anyone needs it. it worked better than the default simplify for me.:

onts <- c("BP", "CC", "MF")

ont_list <- lapply(onts, function(onts) {
  lapply(1:14, function(level) {
    clusterProfiler:::getGOLevel(onts, level)
  })
})

names(ont_list) <- onts

ont <- ego@ontology # e.g., BP

ego@result %<>%
  mutate(
    GO_levels = sapply(ID, function(ID) {
      which(unlist(lapply(ont_list[[ont]], function(s)
        ID %in% s)))
    }),
    first_GO_level = unlist(lapply(GO_levels, min))
  )

ego <- clusterProfiler::simplify(ego, cutoff = 0.7, by = "first_GO_level", select_fun = min)
TylerSagendorf commented 3 months ago

@ryuhjinahn I've also begun work on an R package called TMSig (https://github.com/EMSL-Computing/TMSig/) for dealing with molecular signatures (e.g., gene sets). It has a cluster_sets function that can identify groups of highly similar sets based on one of 3 similarity metrics (Jaccard, Overlap/Simpson, and Ōtsuka). From there, users can define some combination of criteria to select a single set per cluster for analysis. I think clusterProfiler has a function that allows one to input their own named list of gene sets, so that could be used; all of the clusterProfiler functions are just wrappers around fgsea::fgsea, so you could just use that, limma::cameraPR, or TMSig::cameraPR.matrix.

Rough example:

library(dplyr)
library(TMSig)

# Named list of gene sets (may be created via other methods). 
# GMT files can be obtained from the Molecular Signatures Database (MSigDB).
gene_sets <- gmt_to_list("path_to_gmt_file.gmt")

# Character vector of genes from the dataset of interest
background <- unique(mydata$gene_symbol)
background <- background[!is.na(background)]

gene_sets_filt <- filter_sets(gene_sets, 
                              background = background, 
                              min_size = 5L)

clust_df <- cluster_sets(gene_sets_filt, 
                         type = "jaccard", 
                         cutoff = 0.85, 
                         h = 0.9) %>%
  # Add columns like location in the GO hierarchy, 
  # proportion of genes that were present in the background, etc.
  mutate(...) %>%
  # Sort by desired criteria so that the first row of each cluster
  # is the gene set that we want to keep.
  arrange(...) %>%
  filter(!duplicated(cluster))

keep_sets <- clust_df$set

gene_sets_final <- gene_sets_filt[keep_sets]

# Now, gene_sets_final can be used with your analysis method of choice.

I'm trying to prepare this R package for Bioconductor, so the names of the functions and their arguments will likely change in the next major version.