federicogiorgi / corto

corto (Correlation Tool): an R package to generate correlation-based DPI networks
20 stars 7 forks source link

NES and ssgsea #11

Closed chunxuan-hs closed 1 year ago

chunxuan-hs commented 1 year ago

The implementation of ssgsea is very interesting, could you explain how the NES is calculated? Typically that involves permuations but I didn't understand how this is achived here.

federicogiorgi commented 1 year ago

Hi! ssGSEA uses analytical rank-based enrichment analysis (aREA), as in here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5541669/

It is a fast implementation of GSEA for large scale data that extracts p-values in an analytical, not empirical (i.e. permutation-based) way.

On Mon, 19 Jun 2023 at 08:10, chunxuan-hs @.***> wrote:

The implementation of ssgsea is very interesting, could you explain how the NES is calculated? Typically that involves permuations but I didn't understand how this is achived here.

— Reply to this email directly, view it on GitHub https://github.com/federicogiorgi/corto/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWILFI343V7CUSHJFLOHLXL7UNNANCNFSM6AAAAAAZLN7UFU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

chunxuan-hs commented 1 year ago

Many thanks for the quick reply!

My understanding is that the rank of gene expression is convereted to a normal distributed statistic, right?

but what is the purpse of this: nrichmentScore <- relativematches %*% gaussian I lost the point of the most important part.

In the paper, Functional characterization of somatic mutations in cancer using network-based inference of protein activity, it is not clearly why "Regarding this last point, given the linear nature of the mean-based enrichment score, its computation across the elevated number of permutations required to generate the null model can be performed very efficiently by matrix operations". Would you mind explaining this in a few words or providing some references?

Many thanks!

federicogiorgi commented 1 year ago

You are correct about the gaussian distributed statistic.

This relativematches %% gaussian is a matrix multiplication that basically stands at the core of the speed of aREA. You have two matrices: one that is containing the network weights: centroids x targets, one that is containing the normalized expression: targets x samples. Multiplying the two will yield a matrix centroids x samples *that provides the NES of each centroid in each sample. Please note that "pathway" can be used instead of "centroid", the only difference being that pathways have gene members with a binary weight 1/0 (either the gene is in the pathway, or not), while centroids have the weight proportional to the pearson correlation score.

The inventor of the method (to whom all credit should go) is actually Mariano J. Alvarez https://scholar.google.com/citations?user=ZIf1dycAAAAJ&hl=en

On Mon, 19 Jun 2023 at 09:03, chunxuan-hs @.***> wrote:

Many thanks for the quick reply!

My understanding is that the rank of gene expression is convereted to a normal distributed statistic, right?

but what is the purpse of this: nrichmentScore <- relativematches %*% gaussian I lost the point of the most important part.

In the paper, Functional characterization of somatic mutations in cancer using network-based inference of protein activity, it is not clearly why "Regarding this last point, given the linear nature of the mean-based enrichment score, its computation across the elevated number of permutations required to generate the null model can be performed very efficiently by matrix operations". Would you mind explaining this in a few words or providing some references?

Many thanks!

— Reply to this email directly, view it on GitHub https://github.com/federicogiorgi/corto/issues/11#issuecomment-1596622159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWILE6TD6MDTVI6TCPXQTXL72SVANCNFSM6AAAAAAZLN7UFU . You are receiving this because you commented.Message ID: @.***>

chunxuan-hs commented 1 year ago

Thanks for your time! Do you know a reference about why NES could be calucated in that way without the need of permutation? That is is most difficult part for me.

To avoid confusion, I would like to elaborate the NES I was talking about. In the paper with ssGSEA, the Enrichment Score (ES) is calcualted as the integration of difference in ranks, much like GSEA. However, in GSEA, the Normalized Enrichment Score (NES) is calculated by dividing the mean of a null distribution of ES by permutating of samples for generating new statistic.

In ssgsea2.0, permutation is used get the NES. And if nperm = 1, ES is just NES.

In corto, the rank is coverted to quantile, and further converted to value in the range of [-1, 1] following a N(0, 1) distribution. Then a statistic is calcualated via integration of all value as NES after taking squre root. So without permutation, does NES score actually mean ES (Enrichment Score)?

update. I might understand why no permutation is needed, as here the statistic, ES, is the mean of n normal distributed value following N(0, 1). The ES follows a N(0, 1/sqrt(n)), and NES is the z-score for comparison across datasets. Here NES is different from how ssGSEA 2.0 calculated it (ES/mean(NULL_ES).