YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
976 stars 250 forks source link

gseGO: p-values and gseaplot #25

Closed mevers closed 8 years ago

mevers commented 8 years ago

Dear Guangchuang.

I have come across two issues, maybe you can clarify.

I perform a GSEA analysis within clusterProfiler using

res.GSEA.GO<-gseGO(geneList = geneList, organism = "human", exponent = 1, ont = "BP", nPerm = 1000, minGSSize = 15, pvalueCutoff = 0.01, verbose = TRUE);

1.) The resulting table of GO terms all seem to have the same p-values, adjusted p-values, and q-value. For example notice the entries in the last column (qvalues = 0.00470813780684201) :

ID Description setSize enrichmentScore pvalue p.adjust qvalues GO:0000070 GO:0000070 mitotic sister chromatid segregation 104 0.183705167498055 0.000999000999000999 0.00691405858579111 0.00470813780684201 GO:0000075 GO:0000075 cell cycle checkpoint 244 0.1685954733772 0.000999000999000999 0.00691405858579111 0.00470813780684201 GO:0000077 GO:0000077 DNA damage checkpoint 152 0.189584910513727 0.000999000999000999 0.00691405858579111 0.00470813780684201 GO:0000082 GO:0000082 G1/S transition of mitotic cell cycle 242 0.148857665664357 0.000999000999000999 0.00691405858579111 0.00470813780684201

The results are similar for other ontologies.

2.) All GSEA plots seem to have a discontinuity in the the "phenotype" curves. See e.g. here http://imgur.com/AcH49Bh .

Any help in resolving these issues would be greatly appreciated.

Best, Maurits

GuangchuangYu commented 8 years ago

can you send a sample data that can reproduce this issue to gcyu@hku.hk?

mevers commented 8 years ago

I sent you a sample file & data to your email address. Thanks, Maurits

GuangchuangYu commented 8 years ago

This is due to your input geneList is not sorted.

geneList <- sort(geneList, decreasing = T)

will fix the issue.

I have updated the source code, so that if user input un-sorted geneList, it will stop and complain, see https://github.com/GuangchuangYu/DOSE/commit/5a7bc077d72e03d451a02b197f01bf3905431a02.

mevers commented 8 years ago

Dear Guangchuang.

Thanks for the quick update. Yes, this seems to fix the issue with the constant p-values. There still remains the issue with the discontinuities in the phenotype correlation plots. See e.g. term GO:0000070 (attached). Any advice?

Thanks, Maurits

On Tue, Sep 22, 2015 at 1:45 PM, Guangchuang Yu notifications@github.com wrote:

Closed #25 https://github.com/GuangchuangYu/clusterProfiler/issues/25.

— Reply to this email directly or view it on GitHub https://github.com/GuangchuangYu/clusterProfiler/issues/25#event-415464130 .

GuangchuangYu commented 8 years ago

Your geneList is weird with many values identical. This maybe the reason.

> table(geneList) %>% as.data.frame %>% subset(., Freq > 500)
   geneList Freq
54     0.86  526
56     0.88  534
58      0.9  584
60     0.92  570
62     0.94  528
66     0.98  522
68        1  506
mevers commented 8 years ago

Dear Guangchuang.

I'm sorry but that is a very poor excuse. The discontinuity of the phenotype correlation plots suggests to me that this is a numerical issue in your code. It looks like a branch cut in the function you use to plot the phenotype correlation curves that occurs if the values of the ranking metric are not distributed around zero.

Broad's GSEA-P does not seem to have this issue, so it is definitely not an issue with the data.

Best, Maurits

GuangchuangYu commented 8 years ago

The weird thing is I couldn't replicate your issue. It shouldn't happened if your input geneList was sorted.

Can you save the object, res.GSEA.GO, to rda file and send to me?

mevers commented 8 years ago

Dear Guangchuang.

I did some more testing, and indeed the discontinuities in the phenotype plots have disappeared, following sorting of the ranked gene list. So all is well. Thanks for looking into this and your help.

Best regards, Maurits