YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.02k stars 256 forks source link

Enricher - universe calculation unclear? #636

Open christianmaueroeder opened 1 year ago

christianmaueroeder commented 1 year ago

Hi,

thanks for providing clusterProfiler, this package works like a charm for enrichment analysis.

However, I experience the same issue, that is already mentioned a while ago here: https://github.com/YuLab-SMU/clusterProfiler/issues/283, but has not been adressed in detail.

When using Enricher on a user-defined geneset with, for example, 300 genes and defining a a user-specified background containing all genes that are expressed in the particular experimental setting, for example 10000 genes, the intersection of the geneset and the background is used as universe. For example, if only 250 of the genes from the geneset are found in the background, then these 250 genes are used as the universe.

This does not seem to be correct to me, what is the reason to use the intersection, thus, the 250 genes, not the 10000 genes for statistical testing? Other enrichment tools definitely refer to the backgrounnd, thus , the 10000 genes.

Thank you so much for your help!

Cheers

Christian

guidohooiveld commented 1 year ago

This is an interesting question, as is the thread https://github.com/YuLab-SMU/clusterProfiler/issues/283 you referred to.

My thoughts: For an over-representation analysis (ORA) a contingency (2 way frequency) table is created for analysis by Fisher''s exact test. This table basically list the distribution of the combination of significant (yes/no) genes vs membership (yes/no) of a gene set.

            in gene set    NOT in gene set   TOTAL
DE               n11             n12          n1x
not DE           n21             n22          n2x

TOTAL            nx1             nx2          n

I think one could argue that the genes to be used for creating such contingency table should be limited to the genes that are in the universe, but also in (at least) one of the gene sets. Therefore the intersection is taken of the universe and gene set genes.

If for example a gene is in the universe (say, gene Z), but the gene sets only contain info on genes A - N, where to count it in the contingency table? Similarly, in case of the reverse: the gene sets now also contain additional gene ZZ, but the universe doesn't contain gene ZZ, where to count gene ZZ? In both cases taking the intersection will allow the 'placement' of all (remaining) genes in the contingency table.

A much more detailed example on creating a contingency table for ORA can e.g. be found here: https://carpentries-incubator.github.io/bioc-rnaseq/instructor/06-gene-set-analysis.html

[[EDIT]] On second thought I think I could answer my own question, and thereby invalidate my reasoning why the intersection should be used....! Gene Z should be counted in cell n12 or n22 (depending whether it is DE or not). Gene ZZ cannot be placed in any cell, and should therefore be filtered before ORA.

So the intersection should not be taken, but rather the gene sets should be filtered so that these will only contain the genes that actually are in the provided universe (i.e. have been measured in an experiment)... Right?

Note that for completeness I decided to leave my original reasoning in this post rather than change or delete it..

christianmaueroeder commented 1 year ago

I agree with your reasoning. I am neither good at coding nor do I excel at statistics, but the approach used by enricher is, in my opinion, flawed and not in line with what is currently used by most people doing hypergeometric testing for overrepresentation analysis. The intrinsic logic of intersecting is questionable as well: I have a manually curated gene set of 300 genes, from which 50 overlap with my DEG list (which is roughly 2000 genes). My universe consists of roughly 10000 genes, which are expressed by the cell type of interest, however the intersection will reduce it to 250 genes, which is the intersection with the geneset. If I would add additional genesets (such as GO BP) with other biological processes, the intersection with would yield in more genes and give me more power. That's also why some people at https://github.com/YuLab-SMU/clusterProfiler/issues/283 suggested to define the universe as a geneset, which is some sort of workaround, but not a very good one.

guidohooiveld commented 1 year ago

I had some time available to think and read more about this...

In my post above I linked to the online lesson on "RNA-seq analysis with Bioconductor". Initially I did not appreciate it, but after re-reading the page I noticed a section is dedicated on how to choose a proper universe (https://carpentries-incubator.github.io/bioc-rnaseq/instructor/06-gene-set-analysis.html#choose-a-proper-universe).

The instructors of that lesson also noticed that when a universe is provided, within clusterProfiler it is intersected with all genes in the gene set collection, a choice they find very conservative since small universes tend to generate large (=insignificant) p-values.

In enrichGO()/enrichKEGG()/enricher(), universe genes can be set via the universe argument. By default the universe is the total genes in the gene sets collection. When a self-defined universe is provided, this might be different from what you may think, the universe is the intersection of user-provided universe and total genes in the gene set collection. Thus the universe setting in clusterProfiler is very conservative.

Another thing that I found out is that the package fgsea (link), which is used under the hood by clusterProfiler for the actual execution of GSEA, also has implemented an ORA function (named fgseaORA (link)). In their implementation for the calculations the user-provided universe is used, which is NOT intersected with all genes in the gene set collection: https://github.com/ctlab/fgsea/blob/ecfb89c55d938ca84605f1a1380a8cedd70f1ccf/R/fgseaORA.R#L64-L68 Moreover, genes in the gene set collection that are not in the provided universe are filtered: https://github.com/ctlab/fgsea/blob/ecfb89c55d938ca84605f1a1380a8cedd70f1ccf/R/fgseaORA.R#L37C54-L37C56 ... as are genes that are in the input, but not in the universe: https://github.com/ctlab/fgsea/blob/ecfb89c55d938ca84605f1a1380a8cedd70f1ccf/R/fgseaORA.R#L55-L56

Note that the latter (filtering of input genes by the universe) may not always be wanted, and this has been addressed by clusterProfiler by the implementation of the option enrichment_force_universe https://github.com/YuLab-SMU/DOSE/issues/74.

Along the same line, in the lesson "RNA-seq analysis with Bioconductor" the calculation of significance for the Fisher's exact test a.k.a. hypergeometric distribution is explained in detail, and subsequently implemented in a rudimentary ora function in which the universe is NOT intersected with the genes in the gene set collection (https://carpentries-incubator.github.io/bioc-rnaseq/instructor/06-gene-set-analysis.html#further-reading-4).

Idem, for example, here.

Thus, @huerqiang and @GuangchuangYu : what is your opinion on this? Considering all of the above and thread https://github.com/YuLab-SMU/clusterProfiler/issues/283; would it be good not to intersect the universe with the genes in the gene set anymore, but just rather use the provided universe as is (and use it to filter the genes in the gene sets)? Like as implemented in fgsea? Or include/provide an argument when calling any of clusterProfiler's ORA functions to disable the intersection (e.g. intersectWithUniverse, with default setting =TRUE in order to keep backwards compatibility)?

Although not directly related to the way clusterProfiler handles the universe, including information in papers on which universe is used is important and getting more and more attention. See e.g. these papers: https://doi.org/10.1371/journal.pcbi.1009935 and https://doi.org/10.1186/s13059-015-0761-7. In this context, this Twitter thread (mirror) is also enlightening (the before mentioned papers are also mentioned), as is this review paper (https://doi.org/10.1002/ggn2.202200024), especially section 7.3 and Figure 7.

guidohooiveld commented 11 months ago

@huerqiang , @GuangchuangYu : posting again as a gentle reminder; since I (we) would appreciate hearing your opinion on this. TIA.

christianmaueroeder commented 11 months ago

@huerqiang , @GuangchuangYu : posting again as a gentle reminder; since I (we) would appreciate hearing your opinion on this. TIA.

I have used the fsea_ora for the moment and it works great. I have also tried to adjust the code in the clusterprofiler package, or to be more precise, the underlying enricher_internal code, however with my limited knowledge, I was not able to obtain similar results as with fsea_ora (p-values were way too low). For the moment, I am sticking with fsea_ora, however an implementation in clusterprofiler would be nice so that the plotting options are available.

SamGG commented 8 months ago

@guidohooiveld @christianmaueroeder Thanks for this discussion. I am also sensitive to the universe definition. I will read the mentioned literature. I agree to intersect the gene list and the gene sets with the universe, but I don't see any reason to intersect the universe with the gene sets, which will reduce the universe that was observed in the experiment. In fact, I am currently worried by the behavior when no universe is given; no problem with Gene Ontology, as the gene sets cover all genes, but what about smaller universes/databases...

GuangchuangYu commented 6 months ago

Thanks for the discussions, especially the detailed information provided by @guidohooiveld .

The reason to intersect is based on the assumption that the gene sets provided is an annotation at genome-level. Those in the universe that are not appear in the gene sets are genes without annotation and will not be tested. For example, there are only ~9k genes in KEGG database for human and I believe that the universe should not be larger than this number of genes.

It is very easy to report significant gene sets, and I want it to be more conservative in clusterProfiler.

I think the intersect is reasonable. However, as I mentioned previously, this is based on the assumption that the provided gene sets is at genome-level. This assumption maynot be hold in some cases and as many of you have such a concern, we introduce an option via options(enrichment_force_universe = TRUE) to force the universe untouched.

Comments and suggestions are welcome.