churchmanlab / genewalk

GeneWalk identifies relevant gene functions for a biological context using network representation learning
https://churchman.med.harvard.edu/genewalk
BSD 2-Clause "Simplified" License
127 stars 14 forks source link

Ensembl IDs with dots cause problems #12

Closed chrarnold closed 4 years ago

chrarnold commented 4 years ago

Hi, another Ensembl ID related issue: When the IDs contain the ".X" notation, like ".3", the mapping fails for all of them, causing the pipeline to run through but an empty file at the end. I think this should be improved like this: 1) If all IDs could not been mapped, abort right away 2) For Ensembl IDs, if the IDs end with ".X", X being any integer, remove it from the ID and then do the mapping.

We can of course also remove them, but it should be stated somewhere, and the nicest of course is to do it automatically for the user :)

chrarnold commented 4 years ago

In addition, a question: We had a list of Ensembl IDs for mouse as input, and from the 422+ IDs, none could be mapped, this is a bit strange...

Examples: WARNING: [2019-09-24 13:33:23] genewalk.gene_lists - Could not get HGNC ID for ENSEMBL ID ENSMUSG00000032501 WARNING: [2019-09-24 13:33:23] genewalk.gene_lists - Could not get HGNC ID for ENSEMBL ID ENSMUSG00000072568 WARNING: [2019-09-24 13:33:23] genewalk.gene_lists - Could not get HGNC ID for ENSEMBL ID ENSMUSG00000022346 WARNING: [2019-09-24 13:33:23] genewalk.gene_lists - Could not get HGNC ID for ENSEMBL ID ENSMUSG00000047003 WARNING: [2019-09-24 13:33:23] genewalk.gene_lists - Could not get HGNC ID for ENSEMBL ID ENSMUSG00000115388 WARNING: [2019-09-24 13:33:23] genewalk.gene_lists - Could not get HGNC ID for ENSEMBL ID ENSMUSG00000022619 WARNING: [2019-09-24 13:33:23] genewalk.gene_lists - Could not get HGNC ID for ENSEMBL ID ENSMUSG00000052560

bgyori commented 4 years ago

Ensembl ID mappings are currently only available for human genes, not mouse. I think it's a good idea for GeneWalk to implement mappings for a limited number of standard ID types but its primary role is not ID mapping. So I wouldn't want to proliferate resources and dependencies for this. It might be a better idea to first map the mouse Ensembl IDs to MGI (or to corresponding human genes) and run them through GeneWalk that way.

bgyori commented 4 years ago

Implementing stripping off .X from Ensemble IDs is simple and I'll add that, but again, we can't possibly account for all shapes and forms IDs come in, and so it is reasonable to expect that the user do some amount of preprocessing before passing IDs to GeneWalk.

chrarnold commented 4 years ago

Hi, Ensembl IDs are usually simple, either with the ".X" notation or not, I cannot think of any more complications. So adding that in would be a benefit. If mouse cannot be supported at the moment, I understand, might be good to add it to the documentation that Ensembl currently works only for human. Thanks for responding!