differences in # leading edge genes returned: EM vs GSEA. Due to filtering of the GMT file (?)

guidohooiveld commented 4 years ago

Hi, When using EM plugin to select the leading edge genes for a gene set I noticed that these do NOT match with those returned in the GSEA results files. For many gene sets the number of leading edge genes returned by EM is smaller than those found in the GSEA output. See screenshot.

EM_leading_edge

Large version image is uploaded here: https://ibb.co/QfCJ2mV

I think this may be due to the fact that EM considers per gene sets all genes present in the GMT file, whereas GSEA excludes (throws out) the genes that are not present in the input data set from the GMT file before calculating the GSEA metrics. This also causes the presence of the 'grey' genes. See: https://github.com/BaderLab/EnrichmentMapApp/issues/250

Question: [if the above is correct] could this filtering/throwing out of the genes missing in the dataset (but are present in the GMT) also be implemented in the EM plugin? Or alternatively, could these 'gray' genes be ignored when calculating the leading edge genes (that is done [I believe] using the percentages provided by GSEA; TAGS, LOST + SIGNAL values in column 'LEADING EDGE').

risserlin commented 4 years ago

We have always had issues with the leading edge. Unfortunately, the way it is recorded in the GSEA results is simply by saying until which position in the ranked list the leading edge goes to and then for two conditions it numbers the ranked list one way and the other way for the opposite condition. This is an issue that we have dealt with multiple times.

There is already way to filter the datasets so that it only includes the genes the analysis was done with. In the EM dialog select "filter by expression" and that will filter all the genesets by the expression data set used for the analysis.

Can you try doing that to see if it fixes your leading edge issue because I would be very interested in knowing if we introduced this issue when we added the ability to use the entire gmt file instead of the filtered version.

We introduced that feature so that your EM can remain consistent between runs because when you filter to the expression set the connections between genesets can change..

guidohooiveld commented 4 years ago

Thanks for replying so quickly!

Although I saw it, I did not understand the exact meaning of the option 'filter by expression' [I thought all genes below a certain expression level/signal in the file 'expressions' would be removed]. Your remark clarified that! (Apologies if it is there, but because I could not find it I would suggest to add it to the documentation as well?).

After re-running the EM plugin with the option 'filter by expression' checked, the sizes of the gene sets indeed match those reported in GSEA (so all 'grey' genes are absent now). Nice!

However, the leading edge genes are (still) not always identical. I checked multiple gene sets, and the EM plugin often reports 1 and sometimes 2 genes less being leading edge than GSEA. But in one case I checked I could not reproduce the GSEA leasing edge either...??? For the example I made a screenshot from: total size became indeed 70 (not 79 anymore), but still 'only' (the same) 16 genes are reported being leading edge (GSEA reports 18).

If I manually calculate the leading edge genes by multiplying for each gene set the SIZE with %TAGS followed by rounding I assume to reproduce the leading edge genes. For my example: 70 genes x 23% = 16.1 (rounded = 16). EM thus reported 16, but GSEA somehow 18...??? However, for another gene set: 60 genes x 32% = 19.2. EM reported 18, but GSEA 19...??? For a third gene set: 158 x 33% = 52.1. EM reported 51, but GSEA 52. A fourth gene set: 136 x 39% = 53.04. EM reports 52, but GSEA 53. A last gene set: 43 x 51% = 21.9. EM reports 21, but GSEA 22.

So I agree about the issues there are with calculating the leading edge. For now I don't have an explanation for the (small) discrepancies...

BaderLab / EnrichmentMapApp

differences in # leading edge genes returned: EM vs GSEA. Due to filtering of the GMT file (?) #391