YuLab-SMU / DOSE

:mask: Disease Ontology Semantic and Enrichment analysis
https://yulab-smu.top/biomedical-knowledge-mining-book/
117 stars 36 forks source link

enrichGO() gives incorrect results due to different filtering of gene and universe set #11

Closed Gig77 closed 8 years ago

Gig77 commented 8 years ago

When I call enrichGO() from the clusterProfiler package with the exact same Entrez gene IDs for both the "gene" and "universe" parameter (~4000 Entrez IDs in my case), I get a long list of highly significant terms, although really nothing should be significant.

I looked into this issue and it seems that there is a bias introduced when filtering input genes to keep only genes with GO annotations. For the query genes, only genes with annotated GO terms in the specified sub-ontology (i.e. "BP", "CC" or "MF") enter the analysis (which is correct). For background genes, however, all genes with annotated GO terms in any of these three categories are kept, causing an imbalance in GO term frequencies between query and universe gene set.

Can you confirm this?

GuangchuangYu commented 8 years ago

enrichGO do only consider background genes in sub-categories of GO, see https://github.com/GuangchuangYu/clusterProfiler/blob/master/R/enrichGO.R#L126-L132

GuangchuangYu commented 8 years ago

I check the release version, which indeed has the problem you mentioned.

This issue was removed in devel branch which will be released at early of April.

Please test your data with devel branch of DOSE and clusterProfiler.

cb4github commented 8 years ago

Thank you for your contributions!

Q: Has said devel content been released yet? Otherwise, how do I obtain/load it, say, in RStudio.

My result for enrichDO from the published example is truncated to a single pathway - premalignant neoplasm - vice 8 pathways (thoracic cancer, etc.)

Best, CB

GuangchuangYu commented 8 years ago

I got 4 using latest version of DOSE:

> head(x)
                       ID            Description GeneRatio  BgRatio
DOID:1612       DOID:1612          breast cancer    26/147 480/6268
DOID:5093       DOID:5093        thoracic cancer    26/147 480/6268
DOID:0060071 DOID:0060071 pre-malignant neoplasm     5/147  22/6268
DOID:3459       DOID:3459       breast carcinoma    20/147 357/6268
                   pvalue   p.adjust     qvalue
DOID:1612    4.099574e-05 0.01164279 0.01089624
DOID:5093    4.099574e-05 0.01164279 0.01089624
DOID:0060071 1.264612e-04 0.02394332 0.02240803
DOID:3459    2.252906e-04 0.03199127 0.02993994
                                                                                                                                                                geneID
DOID:1612    MMP1/S100A9/FOXM1/S100A8/TOP2A/S100A7/NEK2/CCNA2/MAD2L1/BIRC5/S100P/EZH2/AURKA/CCNB1/PTTG1/CA12/SCGB2A2/WISP2/ERBB4/FOXA1/SCGB1D2/PIP/GATA3/NAT1/PGR/AGR2
DOID:5093    MMP1/S100A9/FOXM1/S100A8/TOP2A/S100A7/NEK2/CCNA2/MAD2L1/BIRC5/S100P/EZH2/AURKA/CCNB1/PTTG1/CA12/SCGB2A2/WISP2/ERBB4/FOXA1/SCGB1D2/PIP/GATA3/NAT1/PGR/AGR2
DOID:0060071                                                                                                                            S100A9/S100A7/MSLN/BIRC5/MMP12
DOID:3459                                       MMP1/S100A9/S100A8/TOP2A/NEK2/CCNA2/MAD2L1/BIRC5/S100P/AURKA/CCNB1/PTTG1/CA12/SCGB2A2/ERBB4/FOXA1/SCGB1D2/PIP/PGR/AGR2
             Count
DOID:1612       26
DOID:5093       26
DOID:0060071     5
DOID:3459       20
cb4github commented 8 years ago

Thanks. FYI, my above result - only 1 pathway - was obtained using RStudio MRO - curious though presumably to be expected.

Also, with R (vice MRO) and the latest DOSE, I do indeed get your result but only per the following.

> head(x)
Error in x[seq_len(n)] : object of type 'S4' is not subsettable
> head(x@result)
                       ID            Description GeneRatio  BgRatio       pvalue
DOID:1612       DOID:1612          breast cancer    26/147 480/6268 4.099574e-05
DOID:5093       DOID:5093        thoracic cancer    26/147 480/6268 4.099574e-05
DOID:0060071 DOID:0060071 pre-malignant neoplasm     5/147  22/6268 1.264612e-04
DOID:3459       DOID:3459       breast carcinoma    20/147 357/6268 2.252906e-04
               p.adjust     qvalue
DOID:1612    0.01164279 0.01089624
DOID:5093    0.01164279 0.01089624
DOID:0060071 0.02394332 0.02240803
DOID:3459    0.03199127 0.02993994
                                                                                                                                                                geneID
DOID:1612    MMP1/S100A9/FOXM1/S100A8/TOP2A/S100A7/NEK2/CCNA2/MAD2L1/BIRC5/S100P/EZH2/AURKA/CCNB1/PTTG1/CA12/SCGB2A2/WISP2/ERBB4/FOXA1/SCGB1D2/PIP/GATA3/NAT1/PGR/AGR2
DOID:5093    MMP1/S100A9/FOXM1/S100A8/TOP2A/S100A7/NEK2/CCNA2/MAD2L1/BIRC5/S100P/EZH2/AURKA/CCNB1/PTTG1/CA12/SCGB2A2/WISP2/ERBB4/FOXA1/SCGB1D2/PIP/GATA3/NAT1/PGR/AGR2
DOID:0060071                                                                                                                            S100A9/S100A7/MSLN/BIRC5/MMP12
DOID:3459                                       MMP1/S100A9/S100A8/TOP2A/NEK2/CCNA2/MAD2L1/BIRC5/S100P/AURKA/CCNB1/PTTG1/CA12/SCGB2A2/ERBB4/FOXA1/SCGB1D2/PIP/PGR/AGR2
             Count
DOID:1612       26
DOID:5093       26
DOID:0060071     5
DOID:3459       20
> 
GuangchuangYu commented 8 years ago
  1. use head directly to enrichResult is currently only available in devel branch.
  2. Do you mean Rstudio gives different result with R console?
cb4github commented 8 years ago

Re 1. Yes, thanks. Re 2. I suspect version control in MRO (Microsoft R Open) is such that I get a different result when I run the DOSE example.