Closed dkoslicki closed 4 months ago
Just an FYI, the CUI
CURIE prefix is going away (in favor of UMLS
) in the new KG2, as CUI:
is not biolink-standard. See #777
This issue seems similar to #862; some disambiguation may be helpful.
We should definitely follow up on CHEBI:5921 though. I do not find CHEBI:5921 in the NodeNamesDescriptions file dumps, and so that's why it's not in the NodeSynonymizer. Is it in KG2? Is it one of these mysterious no-name nodes? It's definitely an entity: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:5921
This issue is for problems with the new ARAX NodeSynonymizer. The other issue #862 is an effort to improve the SRI Node Normalizer.
some more returned from BTE for you:
2020-07-09 04:54:11.845196 WARNING: NodeSynonymizer did not return info for: {'DRUGBANK:DB14236', 'CUI:C0307890', 'CHEMBL.COMPOUND:CHEMBL4298065', 'CUI:C0597152', 'CHEBI:145810', 'CUI:C0599786', 'CHEMBL.COMPOUND:CHEMBL1200915', 'CUI:C0087096', 'CHEMBL.COMPOUND:CHEMBL1200864', 'CHEMBL.COMPOUND:CHEMBL4303209'}
Thank you, a useful list. Let's work through one example: CHEBI:145810 Why isn't it in the NodeSynonymizer? For the simple reason that it is not in NodeNamesDescriptions_KG2.tsv Why is it not in NodeNamesDescriptions_KG2.tsv? I'll need help with that one. @saramsey ? Interestingly, the SRI Node Normalizer does know about it! It returns:
{
"CHEBI:145810": {
"equivalent_identifiers": [
{
"identifier": "CHEBI:145810",
"label": "insulin"
}
],
"id": {
"identifier": "CHEBI:145810",
"label": "insulin"
},
"type": [
"chemical_substance",
"molecular_entity",
"biological_entity",
"named_thing"
]
}
}
Interesting. The SRI normalizer has it all by itself in a cluster of 1. Apparently the only curie for something called "insulin". The NodeSynonymizer has a group of 70 (!) curies that are synonyms of insulin (including CHEBI:5931) At ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:145810 it appears that 145810 (insulin) is the parent term of 5931 (insulin human) and KG2 only has CHEBI:5931
So, SRI Node Normalizer knows about it, but has no synonyms NodeSynonymizer doesn't know about it because it's not in NodeNamesDescriptions_KG2.tsv and is not linked by SRI Node Normalizer.
Possible remedies are:
A stickler for ontologies might claim that the parent term 'insulin' and its child "insulin human' should not be lumped together. But probably many of the 70 curies in the current cluster conflate insulin in general and human insulin, so that's a potential problem with such an automated system summarizing messy data.
The next step is unclear to me.
NodeSynonymizer rebuilt long ago. probably no longer relevant, closing.
This issue will be a running record of curies that the node normalizer doesn't know about. Leaving it to @edeutsch to assess if they should be known about by the node normalizer, or if it's something to let the SRI handle after they get their normalizer more robust.