Extend ENSEMBL identifiers so they work without a version identifier

TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.

MIT License

9 stars 2 forks source link

Extend ENSEMBL identifiers so they work without a version identifier #72

Open gaurav opened 2 years ago

gaurav commented 2 years ago

On NodeNorm, ENSEMBL:ENSP00000263368 exists as a single-identifier clique. However, version 3 of this identifier, ENSEMBL:ENSP00000263368.3, is correctly a part of NCBIGene:645. We should probably map ENSEMBL identifiers to their most recent version. This seems to be okay according to ENSEMBL's stable ID information.

[ ] Make sure we get ENSEMBL:ENSP00000331748 to work on NodeNorm dev (where it is currently missing).

sandrine-m commented 2 years ago

Thank you Gaurav! Usually, computational scientists will use ENSEMBL:ENSP00000263368 as an input and not ENSEMBL:ENSP00000263368.3. Currently, we cannot use NodeNorm with ENSEMBL:ENSP00000263368 with the ENSEMBL CURIE. The ENSEMBL version and built seems up to date to me, just the query input that should be allowed (only gene and transcripts ID are supported to my understanding). I am thinking allowing as inputs something like: ENSEMBL.GENE:ENSG0000009013 ENSEMBL.GENE:ENSG0000009013.11 ENSEMBL.TRANSCRIPT:ENST00000263368 ENSEMBL.TRANSCRIPT:ENST00000263368.9 ENSEMBL.PROTEIN:ENSP00000263368 ENSEMBL.PROTEIN:ENSP00000263368.3

Thanks!

cbizon commented 2 years ago

Note that this is also hurting us in terms of bringing stringdb into yeast robokop @beasleyjonm

sandrine-m commented 2 years ago

Yes, this issue actually come from our STRING onboarding update where we are trying to refine our mapping. STRING key are ENSEmbl protein IDs without versioning (ID of type ENSP00000263368). The aliases they provide in their interactions table has mixed CURIEs (usually gene name but sometimes replaced by the ENSG IDs when they could not map the gene). They also provide a big aliases mapping file that contains their ID choice and the corresponding names for each data source. For Node Normalizer, it might be a quick fix for now to use the Ensembl protein IDs UniProt is providing if you already have UniProt onboarded.

gaurav commented 2 years ago

I'm not sure if this is related, but ENSEMBL:ENSP00000331748 is missing from NodeNorm dev but is correctly normalized in NodeNorm prod. I'm guessing this is because of version issues, but there may of course be other reasons.

gaurav commented 2 years ago

I've set up a new NodeNorm at https://nodenormalization-dev.apps.renci.org/docs based on Babel 2022oct13, which contains the fixes I've added in PR #79. ENSEMBL:ENSP00000263368 is now correctly included with NCBIGene:645 "BLVRB" with conflation and UniProtKB:P30043 "BLVRB_HUMAN Flavin reductase (NADPH) (sprot)" without conflation. Please try out this service and see if other ENSEMBL-related identifiers are resolved as you expect!

It looks like there is a need for additional ENSEMBL identifiers: I'll track that in https://github.com/TranslatorSRI/Babel/issues/84

@sandrine-m Adding additional ENSEMBL prefixes (i.e. ENSEMBL.GENE, ENSEMBL.TRANSCRIPT, etc.) is a decision that will need to be made by the Biolink model maintainers (https://github.com/biolink/biolink-model). Since the ENSEMBL identifiers are distinct between gene/transcript/protein identifiers, I'd be included to use ENSEMBL: as the common prefix for all ENSEMBL identifiers.

sandrine-muller-research commented 2 years ago

Thank you @gaurav!!