YuLab-SMU / DOSE

:mask: Disease Ontology Semantic and Enrichment analysis
https://yulab-smu.top/biomedical-knowledge-mining-book/
117 stars 36 forks source link

Disrepancy GSEA results when using by="DOSE" or by="fgsea" #18

Open guidohooiveld opened 7 years ago

guidohooiveld commented 7 years ago

Hi, I don't know whether I should ask here, or at the maintainer of fgsea (@assaron), but after analyzing the example DOSE dataset using the GSEA algorithm implemented in either DOSE or fgsea, I noticed the results are IMO quite different. I realize that the results are permutation based, and that the results are therefore expected not to be 100% identical, but i find the observed differences quite large. Could you maybe comment on this whether this is expected? Thanks, Guido

Full code below, but main outcome: gseDO(geneList,by="DOSE") identified 162 significantly regulated gene sets gseDO(geneList,by="fgsea") identified 149 significantly regulated gene sets

But overlap is only 138! Unique for 'DOSE' are 24 gene sets, for 'fgsea' 11.

Also the top regulated Gene Sets are completely different... ??

As said before, I expected some differences, but NOT because 2 different implementations of the same algorithm were used, but rather because of the permutations (that are random).

> packageVersion("DOSE")
[1] ‘3.0.4’
> 
> library(DOSE)
> data(geneList)
> x.DOSE <- gseDO(geneList,by="DOSE")
[1] "preparing geneSet collections..."
[1] "calculating observed enrichment scores..."
[1] "calculating permutation scores..."
[1] "calculating p values..."
[1] "leading edge analysis..."
[1] "done..."
> head(x.DOSE, 5)
                       ID                               Description setSize
DOID:5679       DOID:5679                           retinal disease     299
DOID:0060040 DOID:0060040          pervasive developmental disorder     175
DOID:127         DOID:127                                 leiomyoma      89
DOID:9884       DOID:9884                        muscular dystrophy      90
DOID:0060086 DOID:0060086 female reproductive organ benign neoplasm      42
             enrichmentScore       NES      pvalue   p.adjust    qvalues rank
DOID:5679         -0.3676313 -1.573945 0.001308901 0.03271556 0.02193074 1768
DOID:0060040      -0.3909104 -1.590133 0.001418440 0.03271556 0.02193074 2308
DOID:127          -0.4491341 -1.663010 0.001506024 0.03271556 0.02193074 2237
DOID:9884         -0.4623732 -1.713722 0.001515152 0.03271556 0.02193074 1971
DOID:0060086      -0.5530173 -1.780920 0.001636661 0.03271556 0.02193074 1674

DOID:5679    tags=24%, list=14%, signal=21%
DOID:0060040 tags=30%, list=18%, signal=25%
DOID:127     tags=40%, list=18%, signal=33%
DOID:9884    tags=31%, list=16%, signal=26%
DOID:0060086 tags=45%, list=13%, signal=39%
                                                                                                                                                                                                                                                                                                                                                                  core_enrichment
DOID:5679    2878/3791/23247/80184/6750/7450/596/9187/2034/482/948/1490/1280/5737/4314/4881/3426/187/629/6403/6785/2934/5176/7078/5950/727/10516/4311/2247/1295/358/10203/582/10218/57125/585/1675/6310/2202/4313/2944/4254/3075/2099/3480/4653/6387/1471/857/4016/1909/4053/6678/1296/4915/55812/1191/5654/10631/2697/2952/6935/2200/3479/2006/10451/9370/771/652/4693/5346/1524
DOID:0060040                                                                                                 1760/9732/7337/5175/6532/54806/6326/1499/7157/221037/627/3399/2571/3082/3791/27347/596/22829/23426/324/5021/4885/7248/8604/3397/4208/3400/26470/64221/9037/3952/93664/2944/7102/2550/4915/4922/26960/1746/2697/6863/2891/367/4128/7166/6505/5348/18/9370/57502/79083
DOID:127                                                                                                                                                                                             5175/5595/7157/1288/1543/1027/1958/7450/596/947/595/2246/3912/7507/2308/1277/8991/2247/5468/6469/81029/4313/56034/2944/2099/3480/1462/1289/5744/7043/367/1287/4582/3479/5241
DOID:9884                                                                                                                                                                                                                              825/8106/1291/3643/7402/7450/596/1490/8082/2817/284119/54843/1293/23345/3952/4313/10468/6444/9499/2273/3339/1292/3908/81493/3679/3708/4137
DOID:0060086                                                                                                                                                                                                                                                                                1958/7450/596/947/595/3912/1277/8991/3952/4313/2099/3480/1462/1289/7043/367/3479/5241
> x.fgsea <- gseDO(geneList,by="fgsea")
[1] "preparing geneSet collections..."
[1] "GSEA analysis..."
[1] "leading edge analysis..."
[1] "done..."
> head(x.fgsea, 5)
                   ID            Description setSize enrichmentScore       NES
DOID:9884   DOID:9884     muscular dystrophy      90      -0.4623732 -1.709404
DOID:14221 DOID:14221   metabolic syndrome X      22      -0.7194112 -2.018608
DOID:13359 DOID:13359 Ehlers-Danlos syndrome      11      -0.7647954 -1.784188
DOID:3382   DOID:3382            liposarcoma      12       0.7050486  1.825362
DOID:3939   DOID:3939      lipomatous cancer      12       0.7050486  1.825362
                pvalue   p.adjust   qvalues rank                   leading_edge
DOID:9884  0.001494768 0.03715567 0.0247576 1971 tags=31%, list=16%, signal=26%
DOID:14221 0.001626016 0.03715567 0.0247576  894  tags=36%, list=7%, signal=34%
DOID:13359 0.001773050 0.03715567 0.0247576 1023  tags=55%, list=8%, signal=50%
DOID:3382  0.002304147 0.03715567 0.0247576 1022  tags=42%, list=8%, signal=38%
DOID:3939  0.002304147 0.03715567 0.0247576 1022  tags=42%, list=8%, signal=38%
                                                                                                                                      core_enrichment
DOID:9884  825/8106/1291/3643/7402/7450/596/1490/8082/2817/284119/54843/1293/23345/3952/4313/10468/6444/9499/2273/3339/1292/3908/81493/3679/3708/4137
DOID:14221                                                                                                          5468/3952/2152/185/1489/3479/9370
DOID:13359                                                                                                                   1277/1281/1290/1289/1634
DOID:3382                                                                                                                   7153/1029/1019/7545/10644
DOID:3939                                                                                                                   7153/1029/1019/7545/10644
> 

> DOSE.res <- rownames(as.data.frame(x.DOSE))
> fgsea.res <- rownames(as.data.frame(x.fgsea))
> 
> combined.ID <- union(DOSE.res,fgsea.res)
> C <- cbind(combined.ID %in% DOSE.res, combined.ID %in% fgsea.res)
> colnames(C) <- c("DOSE.res","fgsea.res")
> 
> library(limma)
> vennCounts(C)
  DOSE.res fgsea.res Counts
1        0         0      0
2        0         1     11
3        1         0     24
4        1         1    138
attr(,"class")
[1] "VennCounts"
> 
> vennDiagram(vennCounts(C))
> 
>
assaron commented 7 years ago

I would expect some difference due to randomness of the algorithms. You should get similar difference between different runs of the same algorithm. You'd better compare p-values of the same pathway.

The difference should be smaller if you increase number of permutations. I'd recommend to use at least 10000.

On Mon, Nov 14, 2016, 13:26 Guido Hooiveld notifications@github.com wrote:

Hi, I don't know whether I should ask here, or at the maintainer of fgsea (@assaron https://github.com/assaron), but after analyzing the example DOSE dataset using the GSEA algorithm implemented in either DOSE or fgsea, I noticed the results are IMO quite different. I realize that the results are permutation based, and that the results are therefore expected not to be 100% identical, but i find the observed differences quite large. Could you maybe comment on this whether this is expected? Thanks, Guido

Full code below, but main outcome: gseDO(geneList,by="DOSE") identified 162 different gene sets gseDO(geneList,by="fgsea") identified 149 different gene sets

Overlap is only 138! Unique for 'DOSE' are 24 gene sets, for 'fgsea' 11.

Also the top regulated Gene Sets are completely different... ??

As said before, I expected some differences, but NOT because 2 different implementations of the same algorithm were used, but rather because of the permutations (that are random).

packageVersion("DOSE") [1] ‘3.0.4’

library(DOSE) data(geneList) x.DOSE <- gseDO(geneList,by="DOSE") [1] "preparing geneSet collections..." [1] "calculating observed enrichment scores..." [1] "calculating permutation scores..." [1] "calculating p values..." [1] "leading edge analysis..." [1] "done..." head(x.DOSE, 5) ID Description setSize DOID:5679 DOID:5679 retinal disease 299 DOID:0060040 DOID:0060040 pervasive developmental disorder 175 DOID:127 DOID:127 leiomyoma 89 DOID:9884 DOID:9884 muscular dystrophy 90 DOID:0060086 DOID:0060086 female reproductive organ benign neoplasm 42 enrichmentScore NES pvalue p.adjust qvalues rank DOID:5679 -0.3676313 -1.573945 0.001308901 0.03271556 0.02193074 1768 DOID:0060040 -0.3909104 -1.590133 0.001418440 0.03271556 0.02193074 2308 DOID:127 -0.4491341 -1.663010 0.001506024 0.03271556 0.02193074 2237 DOID:9884 -0.4623732 -1.713722 0.001515152 0.03271556 0.02193074 1971 DOID:0060086 -0.5530173 -1.780920 0.001636661 0.03271556 0.02193074 1674

DOID:5679 tags=24%, list=14%, signal=21% DOID:0060040 tags=30%, list=18%, signal=25% DOID:127 tags=40%, list=18%, signal=33% DOID:9884 tags=31%, list=16%, signal=26% DOID:0060086 tags=45%, list=13%, signal=39% core_enrichment DOID:5679 2878/3791/23247/80184/6750/7450/596/9187/2034/482/948/1490/1280/5737/4314/4881/3426/187/629/6403/6785/2934/5176/7078/5950/727/10516/4311/2247/1295/358/10203/582/10218/57125/585/1675/6310/2202/4313/2944/4254/3075/2099/3480/4653/6387/1471/857/4016/1909/4053/6678/1296/4915/55812/1191/5654/10631/2697/2952/6935/2200/3479/2006/10451/9370/771/652/4693/5346/1524 DOID:0060040 1760/9732/7337/5175/6532/54806/6326/1499/7157/221037/627/3399/2571/3082/3791/27347/596/22829/23426/324/5021/4885/7248/8604/3397/4208/3400/26470/64221/9037/3952/93664/2944/7102/2550/4915/4922/26960/1746/2697/6863/2891/367/4128/7166/6505/5348/18/9370/57502/79083 DOID:127 5175/5595/7157/1288/1543/1027/1958/7450/596/947/595/2246/3912/7507/2308/1277/8991/2247/5468/6469/81029/4313/56034/2944/2099/3480/1462/1289/5744/7043/367/1287/4582/3479/5241 DOID:9884 825/8106/1291/3643/7402/7450/596/1490/8082/2817/284119/54843/1293/23345/3952/4313/10468/6444/9499/2273/3339/1292/3908/81493/3679/3708/4137 DOID:0060086 1958/7450/596/947/595/3912/1277/8991/3952/4313/2099/3480/1462/1289/7043/367/3479/5241

x.fgsea <- gseDO(geneList,by="fgsea") [1] "preparing geneSet collections..." [1] "GSEA analysis..." [1] "leading edge analysis..." [1] "done..." head(x.fgsea, 5) ID Description setSize enrichmentScore NES DOID:9884 DOID:9884 muscular dystrophy 90 -0.4623732 -1.709404 DOID:14221 DOID:14221 metabolic syndrome X 22 -0.7194112 -2.018608 DOID:13359 DOID:13359 Ehlers-Danlos syndrome 11 -0.7647954 -1.784188 DOID:3382 DOID:3382 liposarcoma 12 0.7050486 1.825362 DOID:3939 DOID:3939 lipomatous cancer 12 0.7050486 1.825362 pvalue p.adjust qvalues rank leading_edge DOID:9884 0.001494768 0.03715567 0.0247576 1971 tags=31%, list=16%, signal=26% DOID:14221 0.001626016 0.03715567 0.0247576 894 tags=36%, list=7%, signal=34% DOID:13359 0.001773050 0.03715567 0.0247576 1023 tags=55%, list=8%, signal=50% DOID:3382 0.002304147 0.03715567 0.0247576 1022 tags=42%, list=8%, signal=38% DOID:3939 0.002304147 0.03715567 0.0247576 1022 tags=42%, list=8%, signal=38% core_enrichment DOID:9884 825/8106/1291/3643/7402/7450/596/1490/8082/2817/284119/54843/1293/23345/3952/4313/10468/6444/9499/2273/3339/1292/3908/81493/3679/3708/4137 DOID:14221 5468/3952/2152/185/1489/3479/9370 DOID:13359 1277/1281/1290/1289/1634 DOID:3382 7153/1029/1019/7545/10644 DOID:3939 7153/1029/1019/7545/10644

DOSE.res <- rownames(as.data.frame(x.DOSE)) fgsea.res <- rownames(as.data.frame(x.fgsea))

combined.ID <- union(DOSE.res,fgsea.res) C <- cbind(combined.ID %in% DOSE.res, combined.ID %in% fgsea.res) colnames(C) <- c("DOSE.res","fgsea.res")

library(limma) vennCounts(C) DOSE.res fgsea.res Counts 1 0 0 0 2 0 1 11 3 1 0 24 4 1 1 138 attr(,"class") [1] "VennCounts"

vennDiagram(vennCounts(C))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GuangchuangYu/DOSE/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_fWaI33fSKVtQvxQSdQuutwCkLocSaks5q-DdSgaJpZM4KxKYS .

guidohooiveld commented 7 years ago

Thanks Alexey for your comments. If I repeatedly run GSEA either using the DOSE or fgsea algorithmic implementation, I noticed that for each of them the overlap of repeated analyses is quite OK. However, the "between-implementation" variation is (much) larger... I'll do some more studies and report back. G

AboubakarAhamada commented 5 years ago

I have a similar problem but for me the gsea results change run by run. The pathways who were on the top of list are not the same when I re-run . What did I wrong ? Thanks.