ctlab / fgsea

Fast Gene Set Enrichment Analysis
Other
366 stars 65 forks source link

GSEA to discover mixed enriched terms #154

Closed mauritsunkel closed 3 months ago

mauritsunkel commented 4 months ago

Hi,

Now, enriched terms are either up- or down regulated, likely because of the weighting in the running sum implementation. However, there must be terms with mixed enrichment, where some type of mixed activation/inhibition actually leads to very enriched terms, which is now missed through the implementation. I realise this is more in the domain of topology based algorithms, applied to terms which represent pathways, however, I'm hopeful we could get some additional information out of just the ranked expression list.

In order to find mixed enriched terms, I'm curious if it would be proper, and statistically viable, to make the expression values absolute before running GSEA? Or would additional transformations like scaling/normalization be needed additionally?

Best, Maus

assaron commented 4 months ago

Hi @mauritsunkel,

Yes, it's totally appropriate to take absolute value of gene-level stats prior to running GSEA, it will add more focus on mixed-change, as opposed to unidirectional change, which is the default way. In this case also add scoreType="pos" option, as you're interested in one-tailed test.

mauritsunkel commented 4 months ago

Hi @assaron,

Thank you! I am using clusterProfiler::GSEA with by = fgsea (fgsea implementation), could I add scoreType="pos" there for the ellipsis toward the fgsea function? Or is there a specific fgsea function you would recommend me to run?

And, does scoreType="pos" take into account a positively skewed distribution of abs(expression values) in the sorted ranked list? Or should abs(expression values) be transformed to fit a distribution more towards the running sum algorithm of GSEA? And if so, what transformation would you recommend?

assaron commented 4 months ago

I am using clusterProfiler::GSEA with by = fgsea (fgsea implementation), could I add scoreType="pos" there for the ellipsis toward the fgsea function? Or is there a specific fgsea function you would recommend me to run?

I think yes, it should work just fine. Actually if you don't provide scoreType argument and use only positive stats, there should be a warning that would suggest to use scoreType="pos".

And, does scoreType="pos" take into account a positively skewed distribution of abs(expression values) in the sorted ranked list? Or should abs(expression values) be transformed to fit a distribution more towards the running sum algorithm of GSEA? And if so, what transformation would you recommend?

I think it's better to think about (preranked) GSEA as a fancy Kolmogorov-Smirnov that checks that the distribution of the genes in the gene set is not random, and it just gives more weight to more differentially expressed genes based on the absolute values of the stat vector you provided. It doesn't model anything and will work with any distribution. The downside is that this non-randomness could come from the sources unrelated to the question you are trying to answer, so you should always be careful in the result interpretation. Otherwise, whatever works for your case. In our hands looking at absolute value of the default t-statistic from limma or Wald statistic from DESeq2 worked reasonably well.

mauritsunkel commented 3 months ago

Awesome, thanks. Where can I find the implementation code for using scoreType="pos"?

I think indeed the (absolute) expression values give more weight, however, I'm thinking of the distribution independently of this, more dependent on the (pre)ranking used in the algorithm. I will take a look at the default t-statistic from limma or Wald statistic from DESeq2, cheers again!

assaron commented 3 months ago

Where can I find the implementation code for using scoreType="pos"?

It's more or less the same code as a standard score type. Just a little switch at the end: https://github.com/ctlab/fgsea/blob/master/R/fgsea.R#L162