Implementation question

mauritsunkel commented 6 months ago

Hi,

When I reverse my DEG with L2FC input list for fgsea I expect that the outcome would be identical but mirrored.

I noticed that this mostly seems the case, however, for some terms there are actually different values for geneRatio for instance.

How can this be? Might this be because of how the step size upward is calculated differently than downward? And would that be a bug, or a feature (for efficiency)?

Best, Maus

assaron commented 6 months ago

Hi,

There could be a difference if there are ties in your ranking, then the order is not fully defined and there could be some differences. Not sure what you mean by gene ratio though.

mauritsunkel commented 6 months ago

Hi Assaron, thanks for your reply,

You were right about the duplicates:

However, these should fall in the middle of the ranked list, and therefore have little effect on the NES based on the running sum of GSEA and these genes are also not be part of the core enriched genes.

The geneRatio comes from the fgsea output table or is calculated in the clusterProfiler::dotplot, a measure of how many genes from your ranked list were matched with the geneset.

I still think the downward step size seems lower than the upward stepsize in the fgsea implementation, especially compared to whats theoretically noted by the GSEA authors: vs fgsea implementation code:

It seems like the upward stepsize is 'normalized' differently than the downward stepsize, or am I confused by a shortcut in the implementation?

Best, Maus

assaron commented 6 months ago

Maus, upward and downward steps are different and accordingly normalized differently. The excerpt you posted is the description of the preliminary version of GSEA (Mootha et al, 2003), which is essentially a Kolmogorov-Smirnov test. The actual GSEA has different steps, here is the excerpt from the main text: There are three key elements of the GSEA method:

Step 1: Calculation of an Enrichment Score. We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B).

assaron commented 6 months ago

Also the problem with the gene ties is similarly present and the Broad's version, so we just follow that.

To check that the problem is indeed there you can add a small random noise to you input gene stats. That would restrict the order of the genes, but won't affect the weights much. Then the results should be the same, independent of the input ordering.

fgsea does the reordering of the genes in the beginning, but it uses a stable sort, that doesn't change the order of the tied genes, that's why you see the difference when you reverse the order. I believe that Broad's version does the same, but I'm not 100% sure.

mauritsunkel commented 6 months ago

Hi Assaron,

Thanks for clearing it up for me!

Even though I do think the statement: "magnitude of the increment depends on the correlation of the gene with the phenotype", is somewhat ambiguous, I'll accept the interpretation of having normalized differently and accordingly!

Nice suggestion for adding some random noise or something like that to control the ordering, I will look into that.

ctlab / fgsea

Implementation question #153