YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.01k stars 254 forks source link

Issue defining "universe" in enricher function #283

Open GretchenSeim opened 4 years ago

GretchenSeim commented 4 years ago

Currently trying to perform an enrichment using enricher and a self-curated TERM2GENE list and self defined universe of genes. However even though I am specifying a universe it seems to be pulling the default universe - aka the list of genes in the TERM2GENE list. I have ~8000 genes in my universe and it is just pulling the 30 or so in m TERM2GENE list. Wondering if there is something I'm doing wrong or if there is some sort of bug. There isn't any sort of error message - I can just see in the enrichment result that the universe is much smaller than the 8000 or so genes I'd like it to be.

defining my universe

universe2<-rownames(DEs)

this results in a character vector of 8563

then I run the enricher as such

kk2<-enricher(gene=rownames(cluster2), pvalueCutoff = 0.05, universe=universe2, TERM2GENE=TERM2GENE)

and my enrichment result shows a universe of only 52 - the number of genes in my TERM2GENE list

apredeus commented 3 years ago

I have similar problem - I think perhaps "universe" definition is working in a way I don't understand?

I've defined universe parameter to all protein-coding genes in my annotation (about 20k symbols), and then ran enrichment vs. MsigDB Hallmark dataset. However, the background was only 4313 genes - exactly the size of the intersection between the "universe" vector and all symbols in H database.

This is changing p-values a lot, so it would be nice to sort it out.

Jeff-Gui commented 2 years ago

Yes I agree with both of you. The universe is just an intersection between the specified vector and all genes in the "TERM2GENE" list. My guess is that the null distribution of p values are generated from randomly sampling the universe genes and then computing the overlap to genes belonging to a term.

Weihankk commented 2 years ago

I also have the same problem. I have ~27K genes as universe genes, ~25k of them have GO annotation terms and as TERM2GENE, then I choose ~10k interesting genes to conduct run enricher. However, the denominator of bgratio is ~25k, and the denominator of generatio is ~ 9k which is the intersect of ~10k interesting genes and ~25k have GO terms. May I ask if this is normal and correct?

Weihan

avishai987 commented 2 years ago

Maybe it's more accurate to limit the background to genes that are mapped to the database? like some say here: https://www.biostars.org/p/439428/

RaymondSHANG commented 2 years ago

Hi, Prof.Yu, I do have the same issue when using enricher to do the over presentation analysis. The current enricher/DOSE::enricher_internal() used modified hypergeometric test with genes in the (TERM2GENE) as background. This is debatable as discussed:

  1. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009935
  2. https://academic.oup.com/bib/article/22/1/545/5722384

Would be great if we could define our own bg/universe genes in the new versions, rather than the intersect of ourbg with unique(TERM2GENE$gene). If there are really something we need to trim, perhaps we could trim TERM2GENE, removing genes not appeared in the universe, rather than the opposite. As a clusterProfile User, we occasional analyze pathway level expressions using our customized genesets, which might only contain several pathways, and hundreds of genes. universe defined as all genes in DEG analysis, rather than that in TERM2GENE, make more sense.

Best regards,

huerqiang commented 2 years ago

@RaymondSHANG Please read the doc of enricher():

universe | background genes. If missing, the all genes listed in the database (eg TERM2GENE table) will be used as background. -- | --

RaymondSHANG commented 2 years ago

Hi,@huerqiang, Thanks for the reply. I checked the R functions for enricher(). enricher() --> enricher_internal --> DOSE:::enricher_internal(). And, in DOSE:enricher_internal.R, lines 61-68:

if(!is.null(universe)) {
        if (is.character(universe)) {
            extID <- intersect(extID, universe)
        } else {
            ## https://github.com/YuLab-SMU/clusterProfiler/issues/217
            message("`universe` is not in character and will be ignored...")
        }
}

And line 100:

N <- rep(length(extID), length(M))

When you got intersect(extID, universe), even though universe is provided, only the intersect is selected as the background. Based on my current understanding, N should be:

N <- rep(length(universe), length(M))

I think all above posts in the issue are talking about this.

Thank you,

RaymondSHANG commented 2 years ago

I think recalculating N in DOSE will solve the problem.

N <- rep(length(universe), length(M))

TERM2GENE is a kind of global variable, change TERM2GENE may result changes elsewhere in the whole package.

On Thu, Oct 20, 2022 at 9:46 AM Graeme Grimes @.***> wrote:

Would adding all the genes in the universe as a term in the term2gene database fix this? e.g.

define a TERM2GENE data.frame then add another term including all genes in universe

t2uni<-data.frame(term="universe",gene=universe) TERM2GENE=rbind(TERM2GENE,t2uni)

— Reply to this email directly, view it on GitHub https://github.com/YuLab-SMU/clusterProfiler/issues/283#issuecomment-1285861708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6GPCGYR5QDEUMPMXA7QVLWEFZPLANCNFSM4O3IJJ5Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Raymond "Yuan" SHANG, Ph.D.

ariadnaaterrades commented 1 year ago

Hi,

We have also faced with this problem. From our point of view, it would be correct to filter the genesets (gmt) to obtain those genes which are present also in your universe. On the other hand, filtering the universe with the genes in the genesets (intersection between universe and gmt) will greatly affect the p value obtained due to the fact that you are reducing the universe (N) to the geneset collection (eg. GO has many annotated genes ~25K, but theoretically a gmt could have only one geneset). A quick solution for the time being, not the best one though, could be adding your universe to your gmt as a pathway, so all the genes in your universe will be part of the term2gene and the intersection will not affect your universe size nor your significance level.

It would be nice to find a solution to be able to avoid obtaining underestimated significance.

Thank you, Ariadna

christianmaueroeder commented 1 year ago

Hi all,

I am experiencing the same problem at the moment. Is this going to be fixed or is it necessary to adjust manually, as suggested by @RaymondSHANG ?

How did you proceed in this case?

Cheers

SamGG commented 7 months ago

@RaymondSHANG I led to the same conclusion. Do you plan a PR to DOSE?

GuangchuangYu commented 6 months ago

If you want to keep the universe untouched, aka, not performing intersection with genes that have annotations, you can force it to do so via options(enrichment_force_universe = TRUE).

Please try it and give me feedback if there is something not work as expected.

christianmaueroeder commented 6 months ago

If you want to keep the universe untouched, aka, not performing intersection with genes that have annotations, you can force it to do so via options(enrichment_force_universe = TRUE).

Please try it and give me feedback if there is something not work as expected.

Thanks for this option, however, I think what is done now is taking all genes that are stored in OrgDb irrespective of the values specified by universe. For example, after using org.Mm.eg.db, I get a background of 28891 genes, thus, all mouse genes and not only the genes that are expressed in my target cells.

GuangchuangYu commented 6 months ago
> require(clusterProfiler)
> data(geneList, package="DOSE")
> de = names(geneList)[1:100]
> length(geneList)
[1] 12495
> sample(names(geneList), 8000) -> bg
>
> x = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
> head(x, 2)
                   ID                Description GeneRatio  BgRatio
GO:1903047 GO:1903047 mitotic cell cycle process     29/99 390/7388
GO:0000278 GO:0000278         mitotic cell cycle     31/99 465/7388
                 pvalue     p.adjust       qvalue
GO:1903047 9.513903e-15 1.211324e-11 1.126778e-11
GO:0000278 1.856229e-14 1.211324e-11 1.126778e-11
                                                                                                                                                                        geneID    
GO:1903047              8318/55143/991/2305/4605/9833/10403/6241/55165/9787/51203/22974/4751/27338/890/983/4085/9837/5080/81930/3832/2146/64151/9212/1111/9319/3833/51514/6790    
GO:0000278 8318/55143/991/2305/4605/9833/10403/79733/6241/55165/9787/220134/51203/22974/4751/27338/890/983/4085/9837/5080/81930/3832/2146/64151/9212/1111/9319/3833/51514/6790    
           Count
GO:1903047    29
GO:0000278    31

@christianmaueroeder The universe parameter should work for enrichGO as demonstrated above.

GuangchuangYu commented 6 months ago

see also my explanation in https://github.com/YuLab-SMU/clusterProfiler/issues/636#issuecomment-2073994557.

SamGG commented 6 months ago

Thanks for your feedback. If you don't mind, let's use a reproducible example for discussion.

library(DOSE)
#> 
#> DOSE v3.29.3  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
#> 
#> If you use DOSE in published research, please cite:
#> Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609
library(clusterProfiler)
#> clusterProfiler v4.10.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
#> 
#> If you use clusterProfiler in published research, please cite:
#> T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
#> 
#> Attaching package: 'clusterProfiler'
#> The following object is masked from 'package:stats':
#> 
#>     filter

options(width = 100)  # characters width for reprex

# generate data
data(geneList, package="DOSE")
de = names(geneList)[1:100]
length(geneList)
#> [1] 12495
set.seed(1)
bg = c(de, sample(names(geneList)[-(1:100)], 7900, replace = FALSE))
length(unique(bg))
#> [1] 8000

# universe with intersect
options(enrichment_force_universe=FALSE)  # do intersect
getOption("enrichment_force_universe")
#> [1] FALSE
x = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
#> 
x[1:6, 1:5]
#>                    ID                    Description GeneRatio  BgRatio       pvalue
#> GO:0007059 GO:0007059         chromosome segregation     34/99 205/7404 5.613697e-29
#> GO:0000278 GO:0000278             mitotic cell cycle     44/99 470/7404 3.274607e-27
#> GO:0098813 GO:0098813 nuclear chromosome segregation     30/99 161/7404 4.844323e-27
#> GO:1903047 GO:1903047     mitotic cell cycle process     41/99 400/7404 1.011285e-26
#> GO:0000280 GO:0000280               nuclear division     32/99 207/7404 3.366601e-26
#> GO:0000819 GO:0000819   sister chromatid segregation     27/99 127/7404 6.121863e-26

# no universe
y = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP")
y[1:6, 1:5]
#>                    ID                          Description GeneRatio   BgRatio       pvalue
#> GO:0007059 GO:0007059               chromosome segregation     34/99 424/18870 2.432364e-31
#> GO:0098813 GO:0098813       nuclear chromosome segregation     30/99 312/18870 6.474483e-30
#> GO:0000819 GO:0000819         sister chromatid segregation     27/99 225/18870 1.557846e-29
#> GO:0000070 GO:0000070 mitotic sister chromatid segregation     25/99 184/18870 9.747875e-29
#> GO:0140014 GO:0140014             mitotic nuclear division     28/99 274/18870 1.225729e-28
#> GO:0000280 GO:0000280                     nuclear division     32/99 441/18870 4.745944e-28

# universe without intersect
options(enrichment_force_universe=TRUE)  # no intersect
getOption("enrichment_force_universe")
#> [1] TRUE
z = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
z[1:6, 1:5]
#>                    ID                          Description GeneRatio   BgRatio       pvalue
#> GO:0007059 GO:0007059               chromosome segregation     34/99 424/18870 2.432364e-31
#> GO:0098813 GO:0098813       nuclear chromosome segregation     30/99 312/18870 6.474483e-30
#> GO:0000819 GO:0000819         sister chromatid segregation     27/99 225/18870 1.557846e-29
#> GO:0000070 GO:0000070 mitotic sister chromatid segregation     25/99 184/18870 9.747875e-29
#> GO:0140014 GO:0140014             mitotic nuclear division     28/99 274/18870 1.225729e-28
#> GO:0000280 GO:0000280                     nuclear division     32/99 441/18870 4.745944e-28

# size of the universe is not intersected
options(enrichment_force_universe=FALSE)  # do intersect
options(enrichment_N_from_universe = TRUE)  # size untouched
assignInNamespace("enricher_internal", DOSE:::enricher_internal, getNamespace("clusterProfiler"))

w = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
#> N_from_universe is 8000 vs 7404
w[1:6, 1:5]
#>                    ID                    Description GeneRatio  BgRatio       pvalue
#> GO:0007059 GO:0007059         chromosome segregation     34/99 205/8000 4.493552e-30
#> GO:0000278 GO:0000278             mitotic cell cycle     44/99 470/8000 1.378001e-28
#> GO:0098813 GO:0098813 nuclear chromosome segregation     30/99 161/8000 5.177092e-28
#> GO:1903047 GO:1903047     mitotic cell cycle process     41/99 400/8000 5.216023e-28
#> GO:0000280 GO:0000280               nuclear division     32/99 207/8000 3.167084e-27
#> GO:0000819 GO:0000819   sister chromatid segregation     27/99 127/8000 8.103928e-27

Created on 2024-04-25 with reprex v2.1.0

When a bg is provided, the bg is filtered before computing the enrichment, here 7404 => x. When no bg is provided, or, a bg is provided and no intersect is set, the full set is used as background, here 18870 => y and z. What I expect is that the bg is taken as is, here 8000 => w.

In the end, on this example, there is no big difference between x and w, except the p-value.

Maybe I am wrong, but let's discuss. @RaymondSHANG @christianmaueroeder @guidohooiveld

To be noticed, the reported length of the gene list is 99 instead of 100.

GuangchuangYu commented 5 months ago

thanks @SamGG . I find a bug and it has been fixed, see https://github.com/YuLab-SMU/DOSE/commit/84d6860dd71e6f49fe36a5357a85622dfcc4d497.

When options(enrichment_force_universe=TRUE) # no intersect, the full set should be 8000.

> options(enrichment_force_universe=TRUE)  # no intersect
> getOption("enrichment_force_universe")
[1] TRUE
> #> [1] TRUE
> z = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)

> z[1:6, 1:5]
                   ID                    Description GeneRatio  BgRatio
GO:0007059 GO:0007059         chromosome segregation     34/99 209/8000
GO:1903047 GO:1903047     mitotic cell cycle process     42/99 393/8000
GO:0098813 GO:0098813 nuclear chromosome segregation     31/99 163/8000
GO:0000278 GO:0000278             mitotic cell cycle     44/99 462/8000
GO:0000280 GO:0000280               nuclear division     33/99 210/8000
GO:0000819 GO:0000819   sister chromatid segregation     28/99 129/8000
                 pvalue
GO:0007059 8.899575e-30
GO:1903047 1.644771e-29
GO:0098813 2.910847e-29
GO:0000278 6.604743e-29
GO:0000280 2.373501e-28
GO:0000819 4.282627e-28
>
SamGG commented 5 months ago

It sounds great. I will do some checks on my own and come back if I don't understand a point. Thanks for your attention.