BaderLab / EnrichmentMapApp

The EnrichmentMap Cytoscape App allows you to visualize the results of gene-set enrichment as a network.
http://apps.cytoscape.org/apps/enrichmentmap
GNU Lesser General Public License v2.1
31 stars 12 forks source link

Is it the expectation of EM that the ranks file is sorted? #500

Closed risserlin closed 1 year ago

risserlin commented 1 year ago

just had weird resutls from EM. When I created an EM with the ranks file not sorted according to value of the rank the expected rank did not correspond in the created EM.
Went back and sorted the ranks file before creating the EM and it corresponded.

mikekucera commented 1 year ago

Hi Ruth, I just tested this with some fake data and it seems to be working as expected. Can you please send me the original non-sorted ranks file you tried this with? Also where do you see the "weird results", is it just wrong ranks in the heatmap or something else?

risserlin commented 1 year ago

here is the sorted ranks file (I changed the file ending to txt so I could uploated it here) - TCGA-61-2088fakeranks_sorted.txt

here is the unsorted rank file - TCGA-61-2088fakeranks_notsorted.txt

GSEA enrichment results file from sorted analysis - TCGA-61-2088_fgsea_enr_results_sorted_seed42.txt

GSEA enrichment results file from not sorted analysis -

TCGA-61-2088_fgsea_enr_resultsnotsorted_seed42.txt

expression file - TCGA-61-2088fakeexpression.txt

GMT file can be found here - https://download.baderlab.org/EM_Genesets/August_01_2019/Human/symbol/Human_GOBP_AllPathways_with_GO_iea_August_01_2019_symbol.gmt

mikekucera commented 1 year ago

Thanks! I'll take a look.

On Tue, Nov 29, 2022 at 12:09 PM Ruth Isserlin @.***> wrote:

here is the sorted ranks file (I changed the file ending to txt so I could uploated it here) - TCGA-61-2088fakeranks_sorted.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115353/TCGA-61-2088fakeranks_sorted.txt

here is the unsorted rank file - TCGA-61-2088fakeranks_notsorted.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115357/TCGA-61-2088fakeranks_notsorted.txt

GSEA enrichment results file from sorted analysis - TCGA-61-2088_fgsea_enr_results_sorted_seed42.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115365/TCGA-61-2088_fgsea_enr_results_sorted_seed42.txt

GSEA enrichment results file from not sorted analysis -

TCGA-61-2088_fgsea_enr_resultsnotsorted_seed42.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115370/TCGA-61-2088_fgsea_enr_resultsnotsorted_seed42.txt

expression file - TCGA-61-2088fakeexpression.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115375/TCGA-61-2088fakeexpression.txt

GMT file can be found here - https://download.baderlab.org/EM_Genesets/August_01_2019/Human/symbol/Human_GOBP_AllPathways_with_GO_iea_August_01_2019_symbol.gmt

— Reply to this email directly, view it on GitHub https://github.com/BaderLab/EnrichmentMapApp/issues/500#issuecomment-1330989968, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI2JACKEUZJ4U5KSNDOEELWKY2DTANCNFSM6AAAAAASNWIVVM . You are receiving this because you were assigned.Message ID: @.***>

risserlin commented 1 year ago

The EM results are the same for both analysis If you click on any geneset and try and sort the heatmap by ranks from the sorted file or the unsorted file you will see the issue Not sorted ranks file -

Screen Shot 2022-11-29 at 12 09 55 PM

Sorted ranks file

Screen Shot 2022-11-29 at 12 10 41 PM
mikekucera commented 1 year ago

Looks like you're using FGSEA. The presence of the ES and NES columns in the enrichment file is tricking the data set resolver into thinking its from GSEA. I'll have to add a check for the padj column so EM knows its from FGSEA and not GSEA.

risserlin commented 1 year ago

But I want EM to think that it is GSEA. I modified the fgsea files and computed the rank at max so I could tap into the GSEA features in EM I only realized after submitting that the requirement for the rank file to be sorted might be GSEA specific. If that is the case then we just need to specify it somewhere. That is why I marked it as a question and not as a bug.

mikekucera commented 1 year ago

Ok, but I assume you had to enter the files manually using the "..." buttons in the dialog?

risserlin commented 1 year ago

No. I did it through RCy3 build command.

em_command = paste('enrichmentmap build analysisType="GSEA" ', "gmtFile=",file.path(output_filepath,data_directory,basename(gmt_file)), 'pvalue=',pvalue_threshold, 'qvalue=',qvalue_threshold, 'similaritycutoff=',0.375, 'coefficients=',"COMBINED", 'enrichmentsDataset1=',fakeenr_filename_host, 'expressionDataset1=',fakeexp_name_host, 'ranksDataset1=',fakernk_name_host, 'filterByExpressions=false', sep=" ")

mikekucera commented 1 year ago

This looks like a bug. When the ranks file is parsed each gene is assigned a "score", which is the actual value from the rank file, and a "rank" which is basically the position (line number) of the gene in the rank file. I'm guessing this is done because sometimes an EM network is created without a rank file, so sometimes the scores are not available? The heat map is sorting the ranks column based on the "rank", but its showing the "score", that's why it looks broken.

But shouldn't I be able to compute the "rank" by just sorting the genes by "score" and then assigning an index to it?

I can't just fix the heat map because this mismatch of rank and score could affect other things. I think this needs to be fixed in the parser.

risserlin commented 1 year ago

Ok. Now I remember all the intricacies with GSEA ranks files. The expectation is the rank file is sorted in the order GSEA used it to calculate the enrichments.

The reason for the score and the rank is linked to GSEA's leading edge. The column "Rank at max" gives us the rank of the gene where the ES score is at its maximum and any genes with lower rank are part of the leading edge. The reason why we take the rank file from GSEA as it is and don't re-rank it is because if there are ties in the data changing the order would potentially change the composition of the leading edge (even if the ranks we calculated differed only slightly). I think that there were bugs where the one or two genes were missing from the leading edge and it came down to slightly different rank files.

Maybe instead of sorting the unsorted rank file maybe it is better to put in an alert "Your rank file is not sorted". We can give the user the option to have EM sort it for you or keep it as is.

mikekucera commented 1 year ago

I like the idea of just warning the user. My worry about changing the way we compute ranks/scores is that it could have other effects that we aren't aware of. Basically I'm worried it could cause other bugs.

risserlin commented 1 year ago

agreed. GSEA ranks and leading edge calculations are messy. Best to not tamper.