Open TylerSagendorf opened 3 years ago
@TylerSagendorf Hi maybe this is an unrelated issue to your question, but which version did you use? I am using the latest version and I found getGOLevel() function was removed from this package. I have problem converting go id to go level.
@TylerSagendorf Hi maybe this is an unrelated issue to your question, but which version did you use? I am using the latest version and I found getGOLevel() function was removed from this package. I have problem converting go id to go level.
I am using clusterProfiler 4.0.0, but getGOLevel
is not an exported function. That is why I use the triple colon to access it. Also keep in mind that each ID may have multiple levels, depending on their relationship to other terms in the DAG.
Ty so much for this, as it is a very useful function! There are just 2 adjustmensts I had to do:
After that, it worked perfectly. Looking forward for a implementation in the package
thank you very much for this. I modified it to make it run with an output from enrichGO (ego) and figured I could put it here in case anyone needs it. it worked better than the default simplify
for me.:
onts <- c("BP", "CC", "MF")
ont_list <- lapply(onts, function(onts) {
lapply(1:14, function(level) {
clusterProfiler:::getGOLevel(onts, level)
})
})
names(ont_list) <- onts
ont <- ego@ontology # e.g., BP
ego@result %<>%
mutate(
GO_levels = sapply(ID, function(ID) {
which(unlist(lapply(ont_list[[ont]], function(s)
ID %in% s)))
}),
first_GO_level = unlist(lapply(GO_levels, min))
)
ego <- clusterProfiler::simplify(ego, cutoff = 0.7, by = "first_GO_level", select_fun = min)
@ryuhjinahn I've also begun work on an R package called TMSig
(https://github.com/EMSL-Computing/TMSig/) for dealing with molecular signatures (e.g., gene sets). It has a cluster_sets
function that can identify groups of highly similar sets based on one of 3 similarity metrics (Jaccard, Overlap/Simpson, and Ōtsuka). From there, users can define some combination of criteria to select a single set per cluster for analysis. I think clusterProfiler
has a function that allows one to input their own named list of gene sets, so that could be used; all of the clusterProfiler
functions are just wrappers around fgsea::fgsea
, so you could just use that, limma::cameraPR
, or TMSig::cameraPR.matrix
.
Rough example:
library(dplyr)
library(TMSig)
# Named list of gene sets (may be created via other methods).
# GMT files can be obtained from the Molecular Signatures Database (MSigDB).
gene_sets <- gmt_to_list("path_to_gmt_file.gmt")
# Character vector of genes from the dataset of interest
background <- unique(mydata$gene_symbol)
background <- background[!is.na(background)]
gene_sets_filt <- filter_sets(gene_sets,
background = background,
min_size = 5L)
clust_df <- cluster_sets(gene_sets_filt,
type = "jaccard",
cutoff = 0.85,
h = 0.9) %>%
# Add columns like location in the GO hierarchy,
# proportion of genes that were present in the background, etc.
mutate(...) %>%
# Sort by desired criteria so that the first row of each cluster
# is the gene set that we want to keep.
arrange(...) %>%
filter(!duplicated(cluster))
keep_sets <- clust_df$set
gene_sets_final <- gene_sets_filt[keep_sets]
# Now, gene_sets_final can be used with your analysis method of choice.
I'm trying to prepare this R package for Bioconductor, so the names of the functions and their arguments will likely change in the next major version.
I propose to develop a new function that uses a combination of
simplify
andgetGOLevel
to reduce the redundancy of GO terms. The main idea is that if two terms have a high similarity, the more general term will be retained. This is likely to select the parent GO terms. The code that I use for this is provided below, though it will need to be modified slightly. I also recommend saving the list of GO terms at each level and including it as an object in clusterProfiler to speed up computation on the user's end.