YuLab-SMU / DOSE

:mask: Disease Ontology Semantic and Enrichment analysis
https://yulab-smu.top/biomedical-knowledge-mining-book/
117 stars 36 forks source link

DOSE::gseaplot v3.4.0 does NOT plot Running Enrichment Score correctly #23

Closed Ni-Ar closed 6 years ago

Ni-Ar commented 6 years ago

Hi Guangchuang,

I just came across a strange behaviour of your gseaplot function. Please have a look at the attached plots: while walking the ranked list the the enrichment score increases instead of decreasing .

I paste here a long list of ENTREZ ID with fold changes to reproduce the same results I get.

I use the current latest version of these packages:

library(DOSE) # v3.4.0
library(clusterProfiler) #3.6.0
library(org.Mm.eg.db) #3.5.0
library(ggplot2) # 2.2.1

Here is some dummy data

df <- read.table(
  text = "
  IDs    fold
  11419  1.8114110
  241303 1.5163073
  69743  1.4280487
  24066  1.3860948
  13003  1.3602215
  329540  1.3439907
  59290  1.3331197
  268515  1.2874561
  13805  1.2488566
  15531  1.2064958
  18519  1.1876188
  20357  1.0839752
  432530  1.0789844
  268935  1.0473863
  242022  1.0408660
  74434  1.0057392
  75607  0.9963117
  231125  0.9871005
  328365  0.9797246
  13638  0.9638270
  28105  0.9485476
  14633  0.9351837
  21420  0.8928917
  65970  0.8404108
  239273  0.8394041
  20449  0.8318420
  27015  0.8285963
  64297  0.8237914
  217684  0.8118429
  242509  0.8114177
  14632  0.8091203
  72043  0.7655299
  240892  0.7579085
  26448  0.7491474
  435145  0.7460085
  14472  0.7449317
  66659  0.7399364
  17423  0.7394363
  228608  0.7385895
  18508  0.7330988
  17355  0.7245962
  320615  0.7227551
  20443  0.7143673
  213053  0.7094934
  20602  0.6932182
  80750  0.6842234
  57783  0.6794550
  106504  0.6765064
  57330  0.6757687
  11426  0.6711524
  71834  0.6636287
  11569  0.6605111
  26364  0.6572011
  17454  0.6539409
  83921  0.6433791
  319586  0.6433313
  77053  0.6390934
  14739  0.6219398
  20689  0.6212313
  12531  0.6190081
  20741  0.6110583
  67916  0.6014080
  102635415  0.6010881
  66860  0.5996304
  13663  0.5963491
  69674  0.5919567
  240505  0.5863152
  59049  0.5844473
  116904  0.5734434
  192287  0.5706026
  50781  0.5644217
  23808  0.5568801
  432582  0.5527246
  268510  0.5382260
  217430  0.5269599
  320632  0.5247087
  238317  0.5219475
  140630  0.5205014
  17690  0.5181031
  18590  0.5075945
  217203  0.4992899
  235047  0.4917431
  77521  0.4904518
  26914  0.4836161
  230793  0.4809664
  232975  0.4808352
  244152  0.4748497
  327799  0.4704922
  104383  0.4576220
  21848  0.4566733
  78586  0.4534367
  225115  0.4522587
  381694  0.4505329
  106248  0.4430157
  83768  0.4408037
  11776  0.4372013
  208691  0.4293725
  18973  0.4258840
  71529  0.4251805
  235907  0.4180101
  13039 -0.4091740
  26992 -0.4174514
  110175 -0.4221283
  67068 -0.4226254
  11702 -0.4257572
  54369 -0.4295766
  66048 -0.4300320
  110960 -0.4329602
  11974 -0.4335997
  12448 -0.4343772
  66965 -0.4352817
  14227 -0.4407184
  67732 -0.4426311
  56438 -0.4462447
  66169 -0.4499744
  14109 -0.4521151
  56360 -0.4526767
  12306 -0.4532370
  22143 -0.4543185
  16548 -0.4569409
  68592 -0.4582758
  67873 -0.4589291
  19106 -0.4594070
  18715 -0.4603868
  107047 -0.4622102
  227715 -0.4636694
  407819 -0.4641168
  66617 -0.4654829
  50850 -0.4658221
  50887 -0.4659978
  234023 -0.4668781
  223626 -0.4698801
  66151 -0.4711980
  66373 -0.4712438
  67533 -0.4720882
  66152 -0.4721300
  19981 -0.4722849
  105083 -0.4761409
  50918 -0.4768362
  66488 -0.4807836
  328092 -0.4822305
  18938 -0.4829784
  14635 -0.4840790
  11844 -0.4848694
  16782 -0.4857298
  72083 -0.4861101
  16765 -0.4873802
  66181 -0.4891551
  52575 -0.4938584
  53375 -0.4942472
  231887 -0.4962606
  66234 -0.4964732
  74178 -0.4984157
  474160 -0.4996522
  70737 -0.4996648
  14381 -0.5015904
  23802 -0.5023105
  68002 -0.5077146
  59032 -0.5086173
  19655 -0.5092183
  17164 -0.5092912
  29805 -0.5124795
  22278 -0.5125646
  17349 -0.5152144
  67669 -0.5207834
  103742 -0.5249725
  76483 -0.5268463
  59021 -0.5284042
  68151 -0.5306647
  69752 -0.5313681
  15473 -0.5318255
  67023 -0.5329806
  75234 -0.5331870
  11520 -0.5335651
  20539 -0.5337655
  72899 -0.5341078
  21371 -0.5342737
  382423 -0.5343708
  232334 -0.5344406
  13806 -0.5346213
  19053 -0.5361065
  72554 -0.5372592
  73158 -0.5373816
  28200 -0.5387190
  50911 -0.5403000
  67863 -0.5406494
  384009 -0.5414780
  22192 -0.5441624
  66462 -0.5463024
  13835 -0.5485490
  67199 -0.5487169
  22142 -0.5488152
  12048 -0.5504410
  100340 -0.5505587
  18991 -0.5509128
  68642 -0.5528110
  15260 -0.5529886
  68011 -0.5532082
  236539 -0.5536933
  20187 -0.5538852
  16188 -0.5545666
  67399 -0.5552210
  54636 -0.5565188
  66603 -0.5566726
  12010 -0.5567134
  18038 -0.5578204
  20195 -0.5621664
  66928 -0.5643606
  14923 -0.5670511
  21664 -0.5707177
  100503572 -0.5707706
  268822 -0.5720588
  104771 -0.5722843
  13144 -0.5745658
  15368 -0.5755798
  18950 -0.5756777
  50783 -0.5782966
  19291 -0.5787480
  14225 -0.5794338
  52428 -0.5816460
  12913 -0.5818330
  406217 -0.5833224
  20922 -0.5842841
  65113 -0.5865475
  66531 -0.5868339
  13014 -0.5874165
  74776 -0.5877347
  66497 -0.5899444
  17991 -0.5906154
  11951 -0.5910920
  21912 -0.5918880
  613254 -0.5920024
  70082 -0.5926376
  66493 -0.5941234
  15461 -0.5945450
  17117 -0.5950065
  66922 -0.5957754
  14062 -0.5975696
  68944 -0.6003287
  433926 -0.6052443
  28126 -0.6082019
  240514 -0.6107853
  66916 -0.6108003
  14066 -0.6112735
  69113 -0.6132038
  19240 -0.6132045
  66272 -0.6138644
  12412 -0.6154761
  15901 -0.6188316
  69260 -0.6206147
  102462 -0.6215826
  27279 -0.6226238
  230249 -0.6242083
  14232 -0.6248087
  66291 -0.6264662
  68202 -0.6269890
  108147 -0.6303419
  66357 -0.6328645
  73884 -0.6330426
  22004 -0.6341173
  12334 -0.6344510
  11739 -0.6352720
  59022 -0.6377582
  18643 -0.6495493
  51800 -0.6520316
  55948 -0.6543190
  56405 -0.6565285
  12033 -0.6568789
  218121 -0.6607492
  69550 -0.6653172
  20603 -0.6673787
  56075 -0.6710040
  109075 -0.6730239
  94065 -0.6730290
  17768 -0.6747015
  12406 -0.6767111
  19360 -0.6788345
  66096 -0.6794458
  50724 -0.6795209
  226591 -0.6871474
  16525 -0.6881457
  80907 -0.6903230
  13427 -0.6945638
  77006 -0.6948273
  15254 -0.6980982
  17750 -0.7072995
  66167 -0.7098551
  56726 -0.7112438
  15369 -0.7171666
  26611 -0.7195933
  66044 -0.7196741
  70358 -0.7226666
  53624 -0.7297050
  19294 -0.7312160
  12757 -0.7327428
  70546 -0.7330421
  110006 -0.7396354
  66506 -0.7452908
  83485 -0.7545162
  69185 -0.7584184
  107686 -0.7615980
  17748 -0.7635412
  17846 -0.7659095
  66421 -0.7663604
  13032 -0.7753756
  70257 -0.7817643
  56372 -0.7912612
  12759 -0.8047480
  170748 -0.8087696
  16400 -0.8124661
  26377 -0.8178300
  72938 -0.8201042
  76561 -0.8219949
  19188 -0.8452431
  19302 -0.8554460
  71354 -0.8712384
  104725 -0.8762798
  66124 -0.8776357
  93842 -0.8782938
  68041 -0.8872684
  20364 -0.8924618
  16691 -0.9124366
  72114 -0.9172226
  17698 -0.9188080
  100342 -0.9217681
  14702 -0.9308902
  56320 -0.9549968
  56496 -0.9682712
  74551 -0.9816405
  19186 -0.9840911
  382030 -0.9912939
  68026 -0.9937836
  21753 -1.0562861
  13435 -1.0660938
  67009 -1.0820322
  12332 -1.0860237
  19716 -1.0991556
  17751 -1.1183109
  230678 -1.1675092
  52480 -1.1952743
  20194 -1.2182910
  18769 -1.2474444
  229731 -1.2713356
  18793 -1.3165453
  104816 -1.4250049
  242408 -1.5389930
  16668 -1.5971463
  93761 -1.7489666
  232943 -1.7836340
  20200 -2.1023450
  16952 -3.4416559
  ",  header = TRUE)

and I create a ranked list like this

gene_list <- df$fold
names(gene_list) <- df$IDs

I checked if there are duplicates in my data that could possible cause this behaviour, but there were none.

table(duplicated(names(gene_list)))
# FALSE 
#  351 

The data itself looks like this when plotted

plot(gene_list)

image

So when I do a GSEA I call gseGO with these parameters. The IDs are mouse.

gse_obj <- gseGO(geneList     = gene_list,
                 OrgDb        = org.Mm.eg.db,
                 ont          = "ALL",
                 exponent     = 1,
                 nPerm        = 10000,
                 minGSSize    = 5,
                 maxGSSize    = 500,
                 pvalueCutoff = 0.1,
                 pAdjustMethod= "BH",
                 verbose      = TRUE,
                 seed         = TRUE)
# preparing geneSet collections...
# GSEA analysis...
# leading edge analysis...
# done...

With this threshold I find 94 (nrow(as.data.frame(gse_obj))) significant GO terms Until here I had no problems, but if I start plotting the data I feel something is going wrong.

For example:

gseaplot(gse_obj, "GO:0000975", by = "runningScore")

returns this plot: image

You can see that where the Running enrichment Score should be decreasing because there are no genes in such GO term the line remains flat and increases the more you progress along the ranked list instead of decreasing. Now I'm not an expert of GSEA but I think this behaviour is not correct, right?

In addition, to figure out what was going wrong I tried to re-create locally your help functions gseaScores that I copied-pasted from here and the gsInfo that I took from here that gseaplot is using.
If add those function to my local environment and make a copy version of gseaplot called local_gseaplot which are identical and differ only in the name, in such a case the plots looks correct to me.

Here are your functions that I used to try to fix this plotting problem:

gseaScores <- function(geneList, geneSet, exponent=1, fortify=FALSE) {

  ## genes defined in geneSet should appear in geneList.
  geneSet <- intersect(geneSet, names(geneList))

  N <- length(geneList)
  Nh <- length(geneSet)

  Phit <- Pmiss <- numeric(N)
  hits <- names(geneList) %in% geneSet ## logical

  Phit[hits] <- abs(geneList[hits])^exponent
  NR <- sum(Phit)
  Phit <- cumsum(Phit/NR)

  Pmiss[!hits] <-  1/(N-Nh)
  Pmiss <- cumsum(Pmiss)

  runningES <- Phit - Pmiss

  ## ES is the maximum deviation from zero of Phit-Pmiss
  max.ES <- max(runningES)
  min.ES <- min(runningES)
  if( abs(max.ES) > abs(min.ES) ) {
    ES <- max.ES
  } else {
    ES <- min.ES
  }

  if(fortify==TRUE) {
    df <- data.frame(x=seq_along(runningES),
                     runningScore=runningES,
                     position=as.integer(hits)
    )
    return(df)
  }
  return (ES)
}
gsInfo <- function(object, geneSetID) {
  geneList <- object@geneList

  if (is.numeric(geneSetID))
    geneSetID <- object@result[geneSetID, "ID"]

  geneSet <- object@geneSets[[geneSetID]]
  exponent <- object@params[["exponent"]]
  df <- gseaScores(geneList, geneSet, exponent, fortify=TRUE)
  df$ymin=0
  df$ymax=0
  pos <- df$position == 1
  h <- diff(range(df$runningScore))/20
  df$ymin[pos] <- -h
  df$ymax[pos] <- h
  df$geneList <- geneList

  return(df)
}

exactly identical copy of your gseaplot function

local_gseaplot <- function (gseaResult, geneSetID, by = "all", title = "", color = "black", 
                    color.line = "green", color.vline = "#FA5860") 
{
  by <- match.arg(by, c("runningScore", "preranked", "all"))
  x <- y <- ymin <- ymax <- xend <- yend <- runningScore <- es <- pos <- geneList <- NULL
  gsdata <- gsInfo(gseaResult, geneSetID)
  p <- ggplot(gsdata, aes(x = x)) + theme_dose() + xlab("Position in the Ranked List of Genes")
  if (by == "runningScore" || by == "all") {
    p.res <- p + geom_linerange(aes(ymin = ymin, ymax = ymax), 
                                color = color)
    p.res <- p.res + geom_line(aes(y = runningScore), color = color.line, 
                               size = 1)
    enrichmentScore <- gseaResult@result[geneSetID, "enrichmentScore"]
    es.df <- data.frame(es = which.min(abs(p$data$runningScore - 
                                             enrichmentScore)))
    p.res <- p.res + geom_vline(data = es.df, aes(xintercept = es), 
                                colour = color.vline, linetype = "dashed")
    p.res <- p.res + ylab("Running Enrichment Score")
    p.res <- p.res + geom_hline(aes(yintercept = 0))
  }
  if (by == "preranked" || by == "all") {
    df2 <- data.frame(x = which(p$data$position == 1))
    df2$y <- p$data$geneList[df2$x]
    p.pos <- p + geom_segment(data = df2, aes(x = x, xend = x, 
                                              y = y, yend = 0), color = color)
    p.pos <- p.pos + ylab("Ranked list metric") + xlim(0, 
                                                       length(p$data$geneList))
  }
  if (by == "runningScore") 
    return(p.res)
  if (by == "preranked") 
    return(p.pos)
  p.pos <- p.pos + xlab(NULL) + theme(axis.text.x = element_blank(), 
                                      axis.ticks.x = element_blank())
  gp1 <- ggplot_gtable(ggplot_build(p.res))
  gp2 <- ggplot_gtable(ggplot_build(p.pos))
  maxWidth = unit.pmax(gp1$widths[2:3], gp2$widths[2:3])
  gp1$widths[2:3] <- maxWidth
  gp2$widths[2:3] <- maxWidth
  text.params <- gpar(fontsize = 15, fontface = "bold", lineheight = 0.8)
  textgp <- textGrob(title, gp = text.params)
  if (dev.interactive()) 
    grid.newpage()
  pushViewport(viewport(layout = grid.layout(3, 1, heights = unit(c(0.1, 
                                                                    0.7, 0.7), "null"))))
  gp2$vp = viewport(layout.pos.row = 2, layout.pos.col = 1)
  grid.draw(gp2)
  gp1$vp = viewport(layout.pos.row = 3, layout.pos.col = 1)
  grid.draw(gp1)
  textgp$vp = viewport(layout.pos.row = 1, layout.pos.col = 1)
  grid.draw(textgp)
  invisible(list(runningScore = p.res, preranked = p.pos))
}

and in fact when I try to plot the same GO term with the same data:

local_gseaplot(gse_obj, "GO:0000975", by = "runningScore")

it gives me a correct plot! image

Do you think this could be a bug?

I hope this issue is not a duplicate, I googled and could not find any other similar issue.

This is my sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_IE.UTF-8/en_IE.UTF-8/en_IE.UTF-8/C/en_IE.UTF-8/en_IE.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base  

other attached packages:
 [1] ggplot2_2.2.1              org.Mm.eg.db_3.5.0         AnnotationDbi_1.40.0       clusterProfiler_3.6.0     
 [5] DOSE_3.4.0                 DESeq2_1.18.1              SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
 [9] matrixStats_0.53.1         Biobase_2.38.0             GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
[13] IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0   
GuangchuangYu commented 6 years ago

Thanks for your feedback. Yes, it is a bug and has been fixed.

Ni-Ar commented 6 years ago

Thanks for the support! I still have 2 more questions:

  1. Will there still be a gseaplot function in the future releases of DOSE?
  2. If I install DOSE via devtools::install_github("GuangchuangYu/DOSE") will I now be able to plot correctly with gseaplot?
GuangchuangYu commented 6 years ago

No. DOSE will not contain any plot function in future release, you should refer to the enrichplot package.

For update, pls use:

devtools::install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/enrichplot", "GuangchuangYu/clusterProfiler"))
Ni-Ar commented 6 years ago

I see, good to know! Thank you very much for clarifying. :)

Ni-Ar commented 6 years ago

Sorry to bother again I just want to highlight that if one goes for the latest versions with devtools::install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/enrichplot", "GuangchuangYu/clusterProfiler")) this might cause some conflicts when using ChIPSeeker package as well.

library("ChIPseeker")
# Error: object ‘upsetplot’ is not exported by 'namespace:DOSE'

This is not really a problem for me now. I guess this is just a temporary issue, I think it will be fixed with the next release of Bioconductor 3.7.

GuangchuangYu commented 6 years ago

Yes, it will be fine in next release.

sghoshuc commented 6 years ago

how do I increase the font size in gseaplot?

Ni-Ar commented 6 years ago

Here is how I would do it: gseaplot function returns a ggplot object so you could still add layers as you would do like this for example ggplot( ... ) + theme(text = element_text(size = 18) which should increase all the text of the plot to size 18.

I use this trick to add information in the title of plot as shown in the example below where I use the labs(title = ...) layer to add information to the title of the plot.

# Call GSEA function
obj_gse <- gseGO(...)
# Coerc GSEA object into dataframe
df_gse <- as.data.frame(obj_gse)

# Plot the "ID_GO_Term" you like and have the Description, NES and adjusted pvalue in the title
gseaplot(obj_gse, geneSetID = "ID_GO_Term", color.line = "springgreen3", 
           color = "gray89", by = "runningScore") + # or  by = preranked
    labs(title=paste0("GO term: ",
                      df_gse[df_gse$ID==ID_GO_Term, "Description"],
                      ". NES: ",
                      round(df_gse[df_gse$ID==ID_GO_Term,"NES"], 2),
                      ". -log10(padj): ",
                      round(-log10(df_gse[df_gse$ID==ID_GO_Term,"p.adjust"]), 2) ), 
         y = "")

So to address your question it really depends what you mean by "font", what layer are you interested in? You should be able to change specific elements of the plot using theme(...), have a look here.