Closed risserlin closed 1 year ago
Hi Ruth, I just tested this with some fake data and it seems to be working as expected. Can you please send me the original non-sorted ranks file you tried this with? Also where do you see the "weird results", is it just wrong ranks in the heatmap or something else?
here is the sorted ranks file (I changed the file ending to txt so I could uploated it here) - TCGA-61-2088fakeranks_sorted.txt
here is the unsorted rank file - TCGA-61-2088fakeranks_notsorted.txt
GSEA enrichment results file from sorted analysis - TCGA-61-2088_fgsea_enr_results_sorted_seed42.txt
GSEA enrichment results file from not sorted analysis -
TCGA-61-2088_fgsea_enr_resultsnotsorted_seed42.txt
expression file - TCGA-61-2088fakeexpression.txt
GMT file can be found here - https://download.baderlab.org/EM_Genesets/August_01_2019/Human/symbol/Human_GOBP_AllPathways_with_GO_iea_August_01_2019_symbol.gmt
Thanks! I'll take a look.
On Tue, Nov 29, 2022 at 12:09 PM Ruth Isserlin @.***> wrote:
here is the sorted ranks file (I changed the file ending to txt so I could uploated it here) - TCGA-61-2088fakeranks_sorted.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115353/TCGA-61-2088fakeranks_sorted.txt
here is the unsorted rank file - TCGA-61-2088fakeranks_notsorted.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115357/TCGA-61-2088fakeranks_notsorted.txt
GSEA enrichment results file from sorted analysis - TCGA-61-2088_fgsea_enr_results_sorted_seed42.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115365/TCGA-61-2088_fgsea_enr_results_sorted_seed42.txt
GSEA enrichment results file from not sorted analysis -
TCGA-61-2088_fgsea_enr_resultsnotsorted_seed42.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115370/TCGA-61-2088_fgsea_enr_resultsnotsorted_seed42.txt
expression file - TCGA-61-2088fakeexpression.txt https://github.com/BaderLab/EnrichmentMapApp/files/10115375/TCGA-61-2088fakeexpression.txt
GMT file can be found here - https://download.baderlab.org/EM_Genesets/August_01_2019/Human/symbol/Human_GOBP_AllPathways_with_GO_iea_August_01_2019_symbol.gmt
— Reply to this email directly, view it on GitHub https://github.com/BaderLab/EnrichmentMapApp/issues/500#issuecomment-1330989968, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI2JACKEUZJ4U5KSNDOEELWKY2DTANCNFSM6AAAAAASNWIVVM . You are receiving this because you were assigned.Message ID: @.***>
The EM results are the same for both analysis If you click on any geneset and try and sort the heatmap by ranks from the sorted file or the unsorted file you will see the issue Not sorted ranks file -
Sorted ranks file
Looks like you're using FGSEA. The presence of the ES and NES columns in the enrichment file is tricking the data set resolver into thinking its from GSEA. I'll have to add a check for the padj column so EM knows its from FGSEA and not GSEA.
But I want EM to think that it is GSEA. I modified the fgsea files and computed the rank at max so I could tap into the GSEA features in EM I only realized after submitting that the requirement for the rank file to be sorted might be GSEA specific. If that is the case then we just need to specify it somewhere. That is why I marked it as a question and not as a bug.
Ok, but I assume you had to enter the files manually using the "..." buttons in the dialog?
No. I did it through RCy3 build command.
em_command = paste('enrichmentmap build analysisType="GSEA" ', "gmtFile=",file.path(output_filepath,data_directory,basename(gmt_file)), 'pvalue=',pvalue_threshold, 'qvalue=',qvalue_threshold, 'similaritycutoff=',0.375, 'coefficients=',"COMBINED", 'enrichmentsDataset1=',fakeenr_filename_host, 'expressionDataset1=',fakeexp_name_host, 'ranksDataset1=',fakernk_name_host, 'filterByExpressions=false', sep=" ")
This looks like a bug. When the ranks file is parsed each gene is assigned a "score", which is the actual value from the rank file, and a "rank" which is basically the position (line number) of the gene in the rank file. I'm guessing this is done because sometimes an EM network is created without a rank file, so sometimes the scores are not available? The heat map is sorting the ranks column based on the "rank", but its showing the "score", that's why it looks broken.
But shouldn't I be able to compute the "rank" by just sorting the genes by "score" and then assigning an index to it?
I can't just fix the heat map because this mismatch of rank and score could affect other things. I think this needs to be fixed in the parser.
Ok. Now I remember all the intricacies with GSEA ranks files. The expectation is the rank file is sorted in the order GSEA used it to calculate the enrichments.
The reason for the score and the rank is linked to GSEA's leading edge. The column "Rank at max" gives us the rank of the gene where the ES score is at its maximum and any genes with lower rank are part of the leading edge. The reason why we take the rank file from GSEA as it is and don't re-rank it is because if there are ties in the data changing the order would potentially change the composition of the leading edge (even if the ranks we calculated differed only slightly). I think that there were bugs where the one or two genes were missing from the leading edge and it came down to slightly different rank files.
Maybe instead of sorting the unsorted rank file maybe it is better to put in an alert "Your rank file is not sorted". We can give the user the option to have EM sort it for you or keep it as is.
I like the idea of just warning the user. My worry about changing the way we compute ranks/scores is that it could have other effects that we aren't aware of. Basically I'm worried it could cause other bugs.
agreed. GSEA ranks and leading edge calculations are messy. Best to not tamper.
just had weird resutls from EM. When I created an EM with the ranks file not sorted according to value of the rank the expected rank did not correspond in the created EM.
Went back and sorted the ranks file before creating the EM and it corresponded.