Understanding the scoreType

alkurowska commented 5 months ago

Hello,

I am running fgsea on my custom_pathways. I have 445 pathways with rather large gene-sets (see the histogram plot). custom_pathways_distribution

Additionally my logFC that I use for ranking the genes, are skewed to the negative side (see logFC distribution plot). DEA_logFC

For some of the pathways I get the error message indicating that p-values were not calculated properly due to unbalanced gene-level statistic values. This can result in NA values for pval, padj, NES, and log2err. And it is suggested to increase the nr of permutations. However, after re-running the analysis with a higher nr of permutations, the results did not change and I got the same error.

Upon reading more about your tool, I have decided to use "pos" or "neg" scoreType. This resulted in no error for all the pathways, even when I used "pos" scoreType for my data, which is rather skewed towards the negative. The initial ES values ran with the default parameters were mostly negative. After using the "pos" scoreType those pathways ended up with a very low ES score close to zero. Whereas the initial ES values that were positive, ended-up with high positive values after using "pos" scoreType. As I understand, while using "pos", the tool is taking max positive enrichment for each pathway, regardless of the absolute maximum? So the question I am investigating now is understanding the degree of overrepresentation of the pathways in my data, rather than understanding the maximum enrichment in general?

Could you tell me if I understand this correctly? And also could explain to me why the error doesn't appear anymore. If my data is unbalanced, as in, skewed towards the negative, why is the"pos" scoreType working here well.

Thanks!

alkurowska commented 5 months ago

At the same time I ran fgsea for another ranking of the genes (see the logFC distribution plot) DEA_logFC

Here, the data is clearly skewed towards the positive. But I did not get the error, even tho the data is "unbalanced". And the results of "std" and "pos" scoreType are the same.

assaron commented 5 months ago

Hi,

First of all, you gene sets seem to be too big. Preranked GSEA tests non-randomness of a gene set, and when the gene set is large it is likely to contain detectable non-random part, but which is not biologically significant.

About your questions:

As I understand, while using "pos", the tool is taking max positive enrichment for each pathway, regardless of the absolute maximum?

Yes.

And also could explain to me why the error doesn't appear anymore.

GSEA two-sided p-value as proposed by Subramanian et al is calculated as probability of having at least as extreme score divided by the probability to have the score of the same sign. The divider is used to normalize for unbalanced gene statistics, so that positive and negative enrichments are comparable. However, when the gene statistic is highly skewed, the normalization probability becomes very low and harder to estimate, which results in the unbalanced statistic warning. For one-sided scores ("pos" or "neg"), there is no normalization factor, so there is no such problem.

Here, the data is clearly skewed towards the positive. But I did not get the error, even tho the data is "unbalanced". And the results of "std" and "pos" scoreType are the same.

The error only happens if you have pathways with enrichment scores of the sign opposite to shift in gene stats. I would guess, that in your case there are no gene sets with negative enrichment.

ctlab / fgsea

Understanding the scoreType #155