Be forgiving on prefixes? was: Missing CUI's (UMLS)

dkoslicki commented 4 years ago

It appears that the node normalizer cannot currently handle CUI's like CUI:C0017601 (Glaucoma)

eg.

curl -X GET "https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=CUI%3AC0017601" -H "accept: application/json"
No matches found for the specified curie(s).

While our own internal "normalizer" finds this equivalent to: ['MESH:D005901', 'LNC:MTHU020819', 'MEDDRA:10018304', 'DOID:1686', 'NCIT:C26782', 'MEDLINEPLUS:42', 'NCI_CTCAE:E10392', 'CCS:88', 'CUI:C1962986', 'NCI_NICHD:C26782', 'ICD10:H40', 'SNOMEDCT:23986001', 'HPO:HP%3A0000501', 'OMIM:MTHU004639', 'ICD10CM:H40', 'NCI_NCI-GLOSS:CDR0000534224', 'NCI_CTCAE_5:C55842', 'ICD10:H40-H42.9', 'CHV:0000005514', 'NCI_CTCAE_3:C55842', 'HP:0000501', 'NCI_FDA:1875', 'EFO:0000516', 'MEDCIN:30746', 'LNC:LA16302-4', 'CUI:C0017601', 'MONDO:0005041']

cbizon commented 4 years ago

Hi @dkoslicki ,

It does, but it uses the prefix "UMLS", curl -X GET "https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=UMLS%3AC0017601" -H "accept: application/json

{"UMLS:C0017601":{"id":{"identifier":"MONDO:0005041","label":"glaucoma (disease)"},"equivalent_identifiers":[{"identifier":"MONDO:0005041","label":"glaucoma (disease)"},{"identifier":"DOID:1686"},{"identifier":"EFO:0000516","label":"glaucoma"},{"identifier":"UMLS:C0017601"},{"identifier":"MESH:D005901"},{"identifier":"NCIT:C26782"},{"identifier":"SNOMEDCT:23986001"},{"identifier":"HP:0000501","label":"Glaucoma"}],"type":["disease","named_thing","biological_entity","disease_or_phenotypic_feature"]}}

I think that's the biolink-preferred prefix? https://biolink.github.io/biolink-model/docs/Disease.html

That said, there are a number of other identifiers in your list that the normalizer isn't using right now (OMIM, ICD, others).

Is there a most important thing to add from your perspective?

cbizon commented 4 years ago

Oh forgot to mention, you can see which prefixes have been folded in with this call:

curl -X GET "https://nodenormalization-sri.renci.org/get_curie_prefixes?semantictype=disease" -H "accept: application/json"

dkoslicki commented 4 years ago

@cbizon

I think that's the biolink-preferred prefix? https://biolink.github.io/biolink-model/docs/Disease.html

Is the intent for the normalizer to only handle BioLink curies? Seems useful for it to be able to normalize in the sense of not just presenting preferred Biolink curies, but also mapping to biolink curies. Eg. many teams voted "should" and "can" for KP's using Biolink Curie prefixes, so I get the sense that not all curies we will be seeing are from Biolink.

I also ask since I was about to hit the normalizer with an example curie from each of our 186 curie prefix types in ARAX/KG2 to test the coverage of the normalizer.

Is there a most important thing to add from your perspective?

Not particularly from my perspective, as we are in the process of integrating the SRI normalizer into our system with the goal of relying on that instead of our own normalizer/synonymizer (known as KGNodeIndex in previous link).

Oh forgot to mention, you can see which prefixes have been folded in with this call

Here again, teams voted mostly "should" and "can" for KP's returning node semantic types from Biolink, so while it's nice you can pull all prefixes in the normlizer for Biolink semantic types with that curl call, KP's may not be returning anything that maps directly to a certain Biolink semantic type.

cbizon commented 4 years ago

Fair points - it's built now to handle only biolink prefixes, but I think it makes a lot of sense to do some mapping of inputs to be more forgiving for the reasons you mention.

That said, I'm not 100% sure how to generate / maintain such a list. We could do something like pull identifiers.org, but (for instance) I don't think it recognizes CUI as a prefix either.

We could require a context file from the caller for non-biolink prefixes providing the full expansion (this would be the 'right' semantic-web way to do this I think) but practically I doubt that many API writers are providing this.

I think that puts us in the realm of hand curating that list. I wonder if it's part of biolink model or a separate thing...

Anyway, can I ask that as you find prefixes that you are missing that you add them to this or other issues on this repo? And any info you can provide on what that prefix expands to will be super-helpful in generating the initial version of this translation layer.

saramsey commented 4 years ago

Yes, I am working on aligning all of the CURIE prefixes in KG2 with Biolink (see issues 520, 747, and 777 on the RTX repo).

cbizon commented 4 years ago

Thinking a touch more about this... If KPs agree to provide biolink-model prefixes only, then is this still a good idea?

If it is, then I think there are two approaches. First is to create a mapping file ("KEGG.COMPOUND->KEGG", "PUBCHEM->PUBCHEM.COMPOUND"). I'm not excited about this, because of maintaining it, but also just that we might get it wrong.

Would it make sense to have a parameter that can take a jsonld context file, that defines what the prefix maps to? It defaults to the biolink jsonld if none is given, or for prefixes that are not part of the provided context?

@saramsey @dkoslicki interested in your thoughts...

TranslatorSRI / NodeNormalization

Be forgiving on prefixes? was: Missing CUI's (UMLS) #21