Open gaurav opened 2 years ago
Thank you Gaurav! Usually, computational scientists will use ENSEMBL:ENSP00000263368 as an input and not ENSEMBL:ENSP00000263368.3. Currently, we cannot use NodeNorm with ENSEMBL:ENSP00000263368 with the ENSEMBL CURIE. The ENSEMBL version and built seems up to date to me, just the query input that should be allowed (only gene and transcripts ID are supported to my understanding). I am thinking allowing as inputs something like: ENSEMBL.GENE:ENSG0000009013 ENSEMBL.GENE:ENSG0000009013.11 ENSEMBL.TRANSCRIPT:ENST00000263368 ENSEMBL.TRANSCRIPT:ENST00000263368.9 ENSEMBL.PROTEIN:ENSP00000263368 ENSEMBL.PROTEIN:ENSP00000263368.3
Thanks!
Note that this is also hurting us in terms of bringing stringdb into yeast robokop @beasleyjonm
Yes, this issue actually come from our STRING onboarding update where we are trying to refine our mapping. STRING key are ENSEmbl protein IDs without versioning (ID of type ENSP00000263368). The aliases they provide in their interactions table has mixed CURIEs (usually gene name but sometimes replaced by the ENSG IDs when they could not map the gene). They also provide a big aliases mapping file that contains their ID choice and the corresponding names for each data source. For Node Normalizer, it might be a quick fix for now to use the Ensembl protein IDs UniProt is providing if you already have UniProt onboarded.
I'm not sure if this is related, but ENSEMBL:ENSP00000331748 is missing from NodeNorm dev but is correctly normalized in NodeNorm prod. I'm guessing this is because of version issues, but there may of course be other reasons.
I've set up a new NodeNorm at https://nodenormalization-dev.apps.renci.org/docs based on Babel 2022oct13, which contains the fixes I've added in PR #79. ENSEMBL:ENSP00000263368 is now correctly included with NCBIGene:645 "BLVRB" with conflation and UniProtKB:P30043 "BLVRB_HUMAN Flavin reductase (NADPH) (sprot)" without conflation. Please try out this service and see if other ENSEMBL-related identifiers are resolved as you expect!
It looks like there is a need for additional ENSEMBL identifiers: I'll track that in https://github.com/TranslatorSRI/Babel/issues/84
@sandrine-m Adding additional ENSEMBL prefixes (i.e. ENSEMBL.GENE, ENSEMBL.TRANSCRIPT, etc.) is a decision that will need to be made by the Biolink model maintainers (https://github.com/biolink/biolink-model). Since the ENSEMBL identifiers are distinct between gene/transcript/protein identifiers, I'd be included to use ENSEMBL:
as the common prefix for all ENSEMBL identifiers.
Thank you @gaurav!!
On NodeNorm, ENSEMBL:ENSP00000263368 exists as a single-identifier clique. However, version 3 of this identifier, ENSEMBL:ENSP00000263368.3, is correctly a part of NCBIGene:645. We should probably map ENSEMBL identifiers to their most recent version. This seems to be okay according to ENSEMBL's stable ID information.