Open GretchenSeim opened 4 years ago
I have similar problem - I think perhaps "universe" definition is working in a way I don't understand?
I've defined universe parameter to all protein-coding genes in my annotation (about 20k symbols), and then ran enrichment vs. MsigDB Hallmark dataset. However, the background was only 4313 genes - exactly the size of the intersection between the "universe" vector and all symbols in H database.
This is changing p-values a lot, so it would be nice to sort it out.
Yes I agree with both of you. The universe is just an intersection between the specified vector and all genes in the "TERM2GENE" list. My guess is that the null distribution of p values are generated from randomly sampling the universe genes and then computing the overlap to genes belonging to a term.
I also have the same problem. I have ~27K genes as universe genes, ~25k of them have GO annotation terms and as TERM2GENE, then I choose ~10k interesting genes to conduct run enricher. However, the denominator of bgratio is ~25k, and the denominator of generatio is ~ 9k which is the intersect of ~10k interesting genes and ~25k have GO terms. May I ask if this is normal and correct?
Weihan
Maybe it's more accurate to limit the background to genes that are mapped to the database? like some say here: https://www.biostars.org/p/439428/
Hi, Prof.Yu, I do have the same issue when using enricher to do the over presentation analysis. The current enricher/DOSE::enricher_internal() used modified hypergeometric test with genes in the (TERM2GENE) as background. This is debatable as discussed:
Would be great if we could define our own bg/universe genes in the new versions, rather than the intersect of ourbg with unique(TERM2GENE$gene). If there are really something we need to trim, perhaps we could trim TERM2GENE, removing genes not appeared in the universe, rather than the opposite. As a clusterProfile User, we occasional analyze pathway level expressions using our customized genesets, which might only contain several pathways, and hundreds of genes. universe defined as all genes in DEG analysis, rather than that in TERM2GENE, make more sense.
Best regards,
@RaymondSHANG Please read the doc of enricher():
universe | background genes. If missing, the all genes listed in the database (eg TERM2GENE table) will be used as background. -- | --Hi,@huerqiang, Thanks for the reply. I checked the R functions for enricher(). enricher() --> enricher_internal --> DOSE:::enricher_internal(). And, in DOSE:enricher_internal.R, lines 61-68:
if(!is.null(universe)) {
if (is.character(universe)) {
extID <- intersect(extID, universe)
} else {
## https://github.com/YuLab-SMU/clusterProfiler/issues/217
message("`universe` is not in character and will be ignored...")
}
}
And line 100:
N <- rep(length(extID), length(M))
When you got intersect(extID, universe), even though universe is provided, only the intersect is selected as the background. Based on my current understanding, N should be:
N <- rep(length(universe), length(M))
I think all above posts in the issue are talking about this.
Thank you,
I think recalculating N in DOSE will solve the problem.
N <- rep(length(universe), length(M))
TERM2GENE is a kind of global variable, change TERM2GENE may result changes elsewhere in the whole package.
On Thu, Oct 20, 2022 at 9:46 AM Graeme Grimes @.***> wrote:
Would adding all the genes in the universe as a term in the term2gene database fix this? e.g.
define a TERM2GENE data.frame then add another term including all genes in universe
t2uni<-data.frame(term="universe",gene=universe) TERM2GENE=rbind(TERM2GENE,t2uni)
— Reply to this email directly, view it on GitHub https://github.com/YuLab-SMU/clusterProfiler/issues/283#issuecomment-1285861708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6GPCGYR5QDEUMPMXA7QVLWEFZPLANCNFSM4O3IJJ5Q . You are receiving this because you were mentioned.Message ID: @.***>
-- Raymond "Yuan" SHANG, Ph.D.
Hi,
We have also faced with this problem. From our point of view, it would be correct to filter the genesets (gmt) to obtain those genes which are present also in your universe. On the other hand, filtering the universe with the genes in the genesets (intersection between universe and gmt) will greatly affect the p value obtained due to the fact that you are reducing the universe (N) to the geneset collection (eg. GO has many annotated genes ~25K, but theoretically a gmt could have only one geneset). A quick solution for the time being, not the best one though, could be adding your universe to your gmt as a pathway, so all the genes in your universe will be part of the term2gene and the intersection will not affect your universe size nor your significance level.
It would be nice to find a solution to be able to avoid obtaining underestimated significance.
Thank you, Ariadna
Hi all,
I am experiencing the same problem at the moment. Is this going to be fixed or is it necessary to adjust manually, as suggested by @RaymondSHANG ?
How did you proceed in this case?
Cheers
@RaymondSHANG I led to the same conclusion. Do you plan a PR to DOSE?
If you want to keep the universe
untouched, aka, not performing intersection with genes that have annotations, you can force it to do so via options(enrichment_force_universe = TRUE)
.
Please try it and give me feedback if there is something not work as expected.
If you want to keep the
universe
untouched, aka, not performing intersection with genes that have annotations, you can force it to do so viaoptions(enrichment_force_universe = TRUE)
.Please try it and give me feedback if there is something not work as expected.
Thanks for this option, however, I think what is done now is taking all genes that are stored in OrgDb irrespective of the values specified by universe. For example, after using org.Mm.eg.db, I get a background of 28891 genes, thus, all mouse genes and not only the genes that are expressed in my target cells.
> require(clusterProfiler)
> data(geneList, package="DOSE")
> de = names(geneList)[1:100]
> length(geneList)
[1] 12495
> sample(names(geneList), 8000) -> bg
>
> x = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
> head(x, 2)
ID Description GeneRatio BgRatio
GO:1903047 GO:1903047 mitotic cell cycle process 29/99 390/7388
GO:0000278 GO:0000278 mitotic cell cycle 31/99 465/7388
pvalue p.adjust qvalue
GO:1903047 9.513903e-15 1.211324e-11 1.126778e-11
GO:0000278 1.856229e-14 1.211324e-11 1.126778e-11
geneID
GO:1903047 8318/55143/991/2305/4605/9833/10403/6241/55165/9787/51203/22974/4751/27338/890/983/4085/9837/5080/81930/3832/2146/64151/9212/1111/9319/3833/51514/6790
GO:0000278 8318/55143/991/2305/4605/9833/10403/79733/6241/55165/9787/220134/51203/22974/4751/27338/890/983/4085/9837/5080/81930/3832/2146/64151/9212/1111/9319/3833/51514/6790
Count
GO:1903047 29
GO:0000278 31
@christianmaueroeder The universe
parameter should work for enrichGO
as demonstrated above.
see also my explanation in https://github.com/YuLab-SMU/clusterProfiler/issues/636#issuecomment-2073994557.
Thanks for your feedback. If you don't mind, let's use a reproducible example for discussion.
library(DOSE)
#>
#> DOSE v3.29.3 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
#>
#> If you use DOSE in published research, please cite:
#> Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609
library(clusterProfiler)
#> clusterProfiler v4.10.0 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
#>
#> If you use clusterProfiler in published research, please cite:
#> T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
#>
#> Attaching package: 'clusterProfiler'
#> The following object is masked from 'package:stats':
#>
#> filter
options(width = 100) # characters width for reprex
# generate data
data(geneList, package="DOSE")
de = names(geneList)[1:100]
length(geneList)
#> [1] 12495
set.seed(1)
bg = c(de, sample(names(geneList)[-(1:100)], 7900, replace = FALSE))
length(unique(bg))
#> [1] 8000
# universe with intersect
options(enrichment_force_universe=FALSE) # do intersect
getOption("enrichment_force_universe")
#> [1] FALSE
x = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
#>
x[1:6, 1:5]
#> ID Description GeneRatio BgRatio pvalue
#> GO:0007059 GO:0007059 chromosome segregation 34/99 205/7404 5.613697e-29
#> GO:0000278 GO:0000278 mitotic cell cycle 44/99 470/7404 3.274607e-27
#> GO:0098813 GO:0098813 nuclear chromosome segregation 30/99 161/7404 4.844323e-27
#> GO:1903047 GO:1903047 mitotic cell cycle process 41/99 400/7404 1.011285e-26
#> GO:0000280 GO:0000280 nuclear division 32/99 207/7404 3.366601e-26
#> GO:0000819 GO:0000819 sister chromatid segregation 27/99 127/7404 6.121863e-26
# no universe
y = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP")
y[1:6, 1:5]
#> ID Description GeneRatio BgRatio pvalue
#> GO:0007059 GO:0007059 chromosome segregation 34/99 424/18870 2.432364e-31
#> GO:0098813 GO:0098813 nuclear chromosome segregation 30/99 312/18870 6.474483e-30
#> GO:0000819 GO:0000819 sister chromatid segregation 27/99 225/18870 1.557846e-29
#> GO:0000070 GO:0000070 mitotic sister chromatid segregation 25/99 184/18870 9.747875e-29
#> GO:0140014 GO:0140014 mitotic nuclear division 28/99 274/18870 1.225729e-28
#> GO:0000280 GO:0000280 nuclear division 32/99 441/18870 4.745944e-28
# universe without intersect
options(enrichment_force_universe=TRUE) # no intersect
getOption("enrichment_force_universe")
#> [1] TRUE
z = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
z[1:6, 1:5]
#> ID Description GeneRatio BgRatio pvalue
#> GO:0007059 GO:0007059 chromosome segregation 34/99 424/18870 2.432364e-31
#> GO:0098813 GO:0098813 nuclear chromosome segregation 30/99 312/18870 6.474483e-30
#> GO:0000819 GO:0000819 sister chromatid segregation 27/99 225/18870 1.557846e-29
#> GO:0000070 GO:0000070 mitotic sister chromatid segregation 25/99 184/18870 9.747875e-29
#> GO:0140014 GO:0140014 mitotic nuclear division 28/99 274/18870 1.225729e-28
#> GO:0000280 GO:0000280 nuclear division 32/99 441/18870 4.745944e-28
# size of the universe is not intersected
options(enrichment_force_universe=FALSE) # do intersect
options(enrichment_N_from_universe = TRUE) # size untouched
assignInNamespace("enricher_internal", DOSE:::enricher_internal, getNamespace("clusterProfiler"))
w = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
#> N_from_universe is 8000 vs 7404
w[1:6, 1:5]
#> ID Description GeneRatio BgRatio pvalue
#> GO:0007059 GO:0007059 chromosome segregation 34/99 205/8000 4.493552e-30
#> GO:0000278 GO:0000278 mitotic cell cycle 44/99 470/8000 1.378001e-28
#> GO:0098813 GO:0098813 nuclear chromosome segregation 30/99 161/8000 5.177092e-28
#> GO:1903047 GO:1903047 mitotic cell cycle process 41/99 400/8000 5.216023e-28
#> GO:0000280 GO:0000280 nuclear division 32/99 207/8000 3.167084e-27
#> GO:0000819 GO:0000819 sister chromatid segregation 27/99 127/8000 8.103928e-27
Created on 2024-04-25 with reprex v2.1.0
When a bg is provided, the bg is filtered before computing the enrichment, here 7404 => x. When no bg is provided, or, a bg is provided and no intersect is set, the full set is used as background, here 18870 => y and z. What I expect is that the bg is taken as is, here 8000 => w.
In the end, on this example, there is no big difference between x and w, except the p-value.
Maybe I am wrong, but let's discuss. @RaymondSHANG @christianmaueroeder @guidohooiveld
To be noticed, the reported length of the gene list is 99 instead of 100.
thanks @SamGG . I find a bug and it has been fixed, see https://github.com/YuLab-SMU/DOSE/commit/84d6860dd71e6f49fe36a5357a85622dfcc4d497.
When options(enrichment_force_universe=TRUE) # no intersect
, the full set should be 8000.
> options(enrichment_force_universe=TRUE) # no intersect
> getOption("enrichment_force_universe")
[1] TRUE
> #> [1] TRUE
> z = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
> z[1:6, 1:5]
ID Description GeneRatio BgRatio
GO:0007059 GO:0007059 chromosome segregation 34/99 209/8000
GO:1903047 GO:1903047 mitotic cell cycle process 42/99 393/8000
GO:0098813 GO:0098813 nuclear chromosome segregation 31/99 163/8000
GO:0000278 GO:0000278 mitotic cell cycle 44/99 462/8000
GO:0000280 GO:0000280 nuclear division 33/99 210/8000
GO:0000819 GO:0000819 sister chromatid segregation 28/99 129/8000
pvalue
GO:0007059 8.899575e-30
GO:1903047 1.644771e-29
GO:0098813 2.910847e-29
GO:0000278 6.604743e-29
GO:0000280 2.373501e-28
GO:0000819 4.282627e-28
>
It sounds great. I will do some checks on my own and come back if I don't understand a point. Thanks for your attention.
Currently trying to perform an enrichment using enricher and a self-curated TERM2GENE list and self defined universe of genes. However even though I am specifying a universe it seems to be pulling the default universe - aka the list of genes in the TERM2GENE list. I have ~8000 genes in my universe and it is just pulling the 30 or so in m TERM2GENE list. Wondering if there is something I'm doing wrong or if there is some sort of bug. There isn't any sort of error message - I can just see in the enrichment result that the universe is much smaller than the 8000 or so genes I'd like it to be.
defining my universe
universe2<-rownames(DEs)
this results in a character vector of 8563
then I run the enricher as such
kk2<-enricher(gene=rownames(cluster2), pvalueCutoff = 0.05, universe=universe2, TERM2GENE=TERM2GENE)
and my enrichment result shows a universe of only 52 - the number of genes in my TERM2GENE list