Mastermap with enrichmentmap of Great results: gs_size should be changed to be TotalGenes instead of ObsGenes TotalGenes

BaderLab / EnrichmentMapApp

The EnrichmentMap Cytoscape App allows you to visualize the results of gene-set enrichment as a network.

http://apps.cytoscape.org/apps/enrichmentmap

GNU Lesser General Public License v2.1

31 stars 12 forks source link

Mastermap with enrichmentmap of Great results: gs_size should be changed to be TotalGenes instead of ObsGenes TotalGenes #495

Closed veroniquevoisin closed 1 year ago

veroniquevoisin commented 1 year ago

If we create a mastermap with enrichmentmap of Great results of 3 results, the gs-size is coming from the number of ObsGenes from dataset1, and then if there is no results for dataset1, it is coming from the results from dataset2, etc.. Obsgenes can change between datasets as it represents the number of overlap between the sample and the pathways. It should be changed to TotalGenes which represent the total number of genes in the original pathways (and not the overlap) and then it will be identical to gs-size in gsea and more importantly, it will be the same in all the datasets used to build the enrichmentmap.

veroniquevoisin commented 1 year ago

On the same note, the node column 'Enrichmentmap Genes' is not 100% exact as it sometimes lists the genes in dataset1 and sometimes in dataset2 if the pathway is not enriched in dataset1. Ideally, we should have a Gene columns for each dataset (as for fdr for example) but I don't know how it will affect the post-analysis feature for example.

risserlin commented 1 year ago

post analysis has the option to specify which dataset you use (although it is specific to rank file) so it should be able to work with the set of genes if that column was added.

mikekucera commented 1 year ago

The ObsGenes and TotalGenes columns are not being used for the gs_size value. The gs_size value is calculated by EM to be the size of the union of the genesets with the same name across all datasets. That's why it appears as if when there's no results from dataset1 it uses the size from dataset2, its actually the size of the union of the empty geneset from dataset1 with the geneset from dataset2.

It seems the only way to address this would be to have separate gs_size and Genes columns for each dataset. That would add a lot of extra columns and increase the size of the session file. I would hesitate to do that unless its really needed. Please let me know.

veroniquevoisin commented 1 year ago

Ok, I see. I understand now. There are definitely pros and cons and I totally understand the needs to keep things simple when possible. The way it is currently, gs_size would just be used as a visualization tool to identify which nodes contain the largest number of genes. But this is not accurate to get the gene-set value from it and can't be used to do further analysis. Is it stated in the manual guide? I think that we don't need to make the changes now as we actually can get the accurate information from the heatmap if 2 changes can be made: change dummy with the dataset name and change 0.25 to 1 when the heatmap is exported. This way we can export the file and get the numbers and genes if needed for our analysis.We could test that and see if we need to make further changes in the future for the master maps. Let me know what you think.

mikekucera commented 1 year ago

change dummy with the dataset name and change 0.25 to 1 when the heatmap is exported. This way we can export the file and get the numbers and genes if needed for our analysis

I think this is a great idea, and very easy to implement. Only catch is you would have to re-create any networks you need this change for, because the dummy expressions are created when the network is created.

veroniquevoisin commented 1 year ago

"Only catch is you would have to re-create any networks you need this change for, because the dummy expressions are created when the network is created.": ok, I understand. I think that's ok.