ctlab / fgsea

Fast Gene Set Enrichment Analysis
Other
379 stars 67 forks source link

direction of pathway change in the results #96

Closed zhangguy closed 1 year ago

zhangguy commented 3 years ago

Hi, Thanks so much for developing this package. I'm struggling to understand the direction of change for the significant pathways in the results. According to the Broad GSEA website, ES means "A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES indicates gene set enrichment at the bottom of the ranked list." So with this in mind I run the example in the vignette

> exampleRanks[1:10]
   170942    109711     18124     12775     72148     16010     11931     13849    241230    665113 
-63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626 -29.42328 -28.83475 -28.65111 -27.81583 

> fgseaRes <- fgsea(pathways = examplePathways, 
+                   stats    = exampleRanks,
+                   minSize  = 15,
+                   maxSize  = 500)

> head(fgseaRes[order(pval), ])
                       pathway  pval         padj log2err        ES      NES size
1: 5990979_Cell_Cycle,_Mitotic 1e-10 3.906667e-09      NA 0.5594755 2.769070  317
2:          5990980_Cell_Cycle 1e-10 3.906667e-09      NA 0.5388497 2.705894  369
3:     5990981_DNA_Replication 1e-10 3.906667e-09      NA 0.6440006 2.639381   82
4:    5990987_Synthesis_of_DNA 1e-10 3.906667e-09      NA 0.6478555 2.642865   78
5:             5990988_S_Phase 1e-10 3.906667e-09      NA 0.6013092 2.529646   98
6:     5990990_G1_S_Transition 1e-10 3.906667e-09      NA 0.6232905 2.573846   84
                                leadingEdge
1: 66336,66977,12442,107995,66442,12571,...
2: 66336,66977,12442,107995,66442,19361,...
3:  57441,17219,69270,12575,69263,17215,...
4:  17219,69270,12575,69263,17215,68240,...
5:  67849,17219,69270,12575,69263,71988,...
6:  20135,13555,17219,12575,12448,17215,...

And assuming the values in exampleRanks are log2FC from RNA-Seq data, I would naively think "5990979_Cell_Cycle,_Mitotic" pathway is down-regulated: ES/NES >0 --> pathway genes enriched at the top/beginning of the ranked list + the list is ranked in ascending order and minus values (down-regulated genes) are at the beginning of the list. But as a matter of fact the leading edge genes are all up-regulated, which means "5990979_Cell_Cycle,_Mitotic" is perhaps up-regulated:

> fgseaRes[order(pval), ]$leadingEdge[[1]]
  [1] "66336"  "66977"  "12442"  "107995" "66442"  "12571"  "52276"  "54392"  "215387" "67629"  "72415"  "56150" 
 [13] "57441"  "20877"  "67121"  "12615"  "21973"  "11799"  "66468"  "20135"  "67849"  "73804"  "70454"  "76044" 
 [25] "20878"  "13555"  "60411"  "12580"  "17219"  "69270"  "12575"  "69263"  "12448"  "14211"  "20873"  "18005" 
 [37] "71988"  "17215"  "12534"  "208628" "22390"  "68240"  "228421" "68014"  "19348"  "12236"  "72151"  "18817" 
 [49] "21781"  "18968"  "66934"  "272551" "227613" "67141"  "68612"  "68298"  "13361"  "108000" "23834"  "106344"
 [61] "56452"  "242705" "18141"  "223921" "26886"  "13557"  "268697" "72155"  "56371"  "12531"  "327762" "12567" 
 [73] "229841" "67052"  "16319"  "66634"  "67203"  "17865"  "12235"  "19891"  "74470"  "381318" "66570"  "17216" 
 [85] "19687"  "17218"  "102920" "18973"  "16881"  "17463"  "75786"  "19645"  "19075"  "69736"  "19357"  "76816" 
 [97] "70385"  "22628"  "22627"  "52683"  "19076"  "18972"  "12544"  "17997"  "26440"  "68549"  "19088"  "269113"
[109] "26444"  "19324"  "103733" "19179"  "12579"  "232987" "17420"  "228769" "26445"  "105988" "69745"  "18538" 
[121] "69928"  "57296"  "14235"  "22171"  "19170"  "17220"  "50793"  "18392"  "236930" "67151"  "70024"  "59126" 
[133] "66296"  "109145" "71819"  "67733"  "12447"  "12532"  "14156"  "26442"  "19177" 

> exampleRanks[fgseaRes[order(pval), ]$leadingEdge[[1]]]
    66336     66977     12442    107995     66442     12571     52276     54392    215387     67629     72415 
39.099576 32.780539 28.883208 28.369136 26.080504 25.619366 22.362191 22.286145 21.850554 21.602498 20.342400 
    56150     57441     20877     67121     12615     21973     11799     66468     20135     67849     73804 
19.970759 19.690252 19.655502 19.422808 19.186244 19.168818 19.085438 19.044938 19.014049 18.952087 18.617079 
    70454     76044     20878     13555     60411     12580     17219     69270     12575     69263     12448 
18.372291 18.223929 18.146682 17.757605 17.483727 17.442589 17.207697 17.098920 17.090841 17.064031 16.605173 
    14211     20873     18005     71988     17215     12534    208628     22390     68240    228421     68014 
16.533431 16.316307 16.255122 15.981281 15.419719 15.162649 15.101509 14.671479 14.669090 14.484183 14.277380 
    19348     12236     72151     18817     21781     18968     66934    272551    227613     67141     68612 
13.856416 13.819854 13.664357 13.469345 13.326182 13.099133 12.844657 12.763747 12.742726 12.705035 12.093330 
    68298     13361    108000     23834    106344     56452    242705     18141    223921     26886     13557 
12.013490 11.816168 11.794317 11.595685 11.459378 11.256091 11.078580 11.063600 10.928233 10.905617 10.859834 
   268697     72155     56371     12531    327762     12567    229841     67052     16319     66634     67203 
10.675342 10.496719 10.465043 10.316958 10.316755 10.293335 10.012741  9.938188  9.800941  9.763723  9.407897 
    17865     12235     19891     74470    381318     66570     17216     19687     17218    102920     18973 
 9.390812  9.171506  8.935658  8.889337  8.886444  8.861748  8.744532  8.604433  8.601330  8.368943  8.242068 
    16881     17463     75786     19645     19075     69736     19357     76816     70385     22628     22627 
 8.216628  8.151901  8.080499  8.037318  8.034202  7.958835  7.924134  7.914065  7.849681  7.640995  7.587983 
    52683     19076     18972     12544     17997     26440     68549     19088    269113     26444     19324 
 7.545842  7.513623  7.467283  7.344200  7.324556  7.187322  7.126361  7.027971  6.930873  6.806526  6.767448 
   103733     19179     12579    232987     17420    228769     26445    105988     69745     18538     69928 
 6.696744  6.529158  6.525286  6.522285  6.478037  6.438622  6.303510  6.298807  6.294256  6.227249  6.218202 
    57296     14235     22171     19170     17220     50793     18392    236930     67151     70024     59126 
 6.028760  5.989858  5.930712  5.929341  5.699892  5.602002  5.556965  5.447543  5.422654  5.421143  5.337079 
    66296    109145     71819     67733     12447     12532     14156     26442     19177 
 5.294615  5.250755  5.208064  5.198138  5.073297  5.017712  4.999693  4.991733  4.977053 

Now I'm really confused. Furthermore if we shuffle the order of exampleRanks, the gsea results are the same:

> exampleRanks <- sample(exampleRanks)
> exampleRanks[1:10]
    66082    233490     76719     20534     68238     74838    234371     66423     74570    106861 
-6.511741 -4.938649 -6.188603 -3.380223  1.785945 -4.049832 -2.210378  2.155477  2.380274 -2.636506 
> fgseaRes <- fgsea(pathways = examplePathways, 
+                   stats    = exampleRanks,
+                   minSize  = 15,
+                   maxSize  = 500)
> head(fgseaRes[order(pval), ])
                       pathway  pval         padj log2err        ES      NES size
1: 5990979_Cell_Cycle,_Mitotic 1e-10 3.906667e-09      NA 0.5594755 2.714418  317
2:          5990980_Cell_Cycle 1e-10 3.906667e-09      NA 0.5388497 2.653469  369
3:     5990981_DNA_Replication 1e-10 3.906667e-09      NA 0.6440006 2.650686   82
4:    5990987_Synthesis_of_DNA 1e-10 3.906667e-09      NA 0.6478555 2.631610   78
5:             5990988_S_Phase 1e-10 3.906667e-09      NA 0.6013092 2.535833   98
6:     5990990_G1_S_Transition 1e-10 3.906667e-09      NA 0.6232905 2.573791   84
                                leadingEdge
1: 66336,66977,12442,107995,66442,12571,...
2: 66336,66977,12442,107995,66442,19361,...
3:  57441,17219,69270,12575,69263,17215,...
4:  17219,69270,12575,69263,17215,68240,...
5:  67849,17219,69270,12575,69263,71988,...
6:  20135,13555,17219,12575,12448,17215,...

It seems that gsea does an internal ranking based on the values of the vector provided, and it doesn't matter if we pre-rank the gene list or not as long as the values are provided? If so, it might be more intuitive to provide exampleRanks as a shuffled vector and document the internal ranking behavior (ascending / descending) some where in the manual or vignette. Or I'm completely wrong here.

Thanks in advance for your answer

assaron commented 3 years ago

Hi,

exampleRanks is moderated t-statistic, not log2FC, but has the same direction.

ES means "A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES indicates gene set enrichment at the bottom of the ranked list."

In GSEA the list is ranked in descending order. I think the more intuitive description of ES is not based on the ranking directly: high positive ES means enrichment in genes with positive gene-level statistics.

It seems that gsea does an internal ranking based on the values of the vector provided, and it doesn't matter if we pre-rank the gene list or not as long as the values are provided?

Yes, the order of the genes in the input vector does not matter.

If so, it might be more intuitive to provide exampleRanks as a shuffled vector and document the internal ranking behavior (ascending / descending) some where in the manual or vignette.

Makes sense. We'll think about it.

Nelson-Gon commented 2 years ago

HI @assaron I have been looking at this from YuLab-SMU/DOSE# and wondering if there is any update on this bit

If so, it might be more intuitive to provide exampleRanks as a shuffled vector and document the internal ranking behavior (ascending / descending) some where in the manual or vignette.

If the order of the input genes does not matter, does this mean fgsea is always performing GSEAPreranked?

EDIT: Got it, it's always preranked as the docs state.

Thank you, Nelson