RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

CURIES that node synonymizer can't handle #898

Closed dkoslicki closed 4 months ago

dkoslicki commented 4 years ago

This issue will be a running record of curies that the node normalizer doesn't know about. Leaving it to @edeutsch to assess if they should be known about by the node normalizer, or if it's something to let the SRI handle after they get their normalizer more robust.

dkoslicki commented 4 years ago
saramsey commented 4 years ago

Just an FYI, the CUI CURIE prefix is going away (in favor of UMLS) in the new KG2, as CUI: is not biolink-standard. See #777

saramsey commented 4 years ago

This issue seems similar to #862; some disambiguation may be helpful.

edeutsch commented 4 years ago

We should definitely follow up on CHEBI:5921 though. I do not find CHEBI:5921 in the NodeNamesDescriptions file dumps, and so that's why it's not in the NodeSynonymizer. Is it in KG2? Is it one of these mysterious no-name nodes? It's definitely an entity: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:5921

edeutsch commented 4 years ago

This issue is for problems with the new ARAX NodeSynonymizer. The other issue #862 is an effort to improve the SRI Node Normalizer.

amykglen commented 4 years ago

some more returned from BTE for you:

2020-07-09 04:54:11.845196 WARNING: NodeSynonymizer did not return info for: {'DRUGBANK:DB14236', 'CUI:C0307890', 'CHEMBL.COMPOUND:CHEMBL4298065', 'CUI:C0597152', 'CHEBI:145810', 'CUI:C0599786', 'CHEMBL.COMPOUND:CHEMBL1200915', 'CUI:C0087096', 'CHEMBL.COMPOUND:CHEMBL1200864', 'CHEMBL.COMPOUND:CHEMBL4303209'}

edeutsch commented 4 years ago

Thank you, a useful list. Let's work through one example: CHEBI:145810 Why isn't it in the NodeSynonymizer? For the simple reason that it is not in NodeNamesDescriptions_KG2.tsv Why is it not in NodeNamesDescriptions_KG2.tsv? I'll need help with that one. @saramsey ? Interestingly, the SRI Node Normalizer does know about it! It returns:

{
  "CHEBI:145810": {
    "equivalent_identifiers": [
      {
        "identifier": "CHEBI:145810",
        "label": "insulin"
      }
    ],
    "id": {
      "identifier": "CHEBI:145810",
      "label": "insulin"
    },
    "type": [
      "chemical_substance",
      "molecular_entity",
      "biological_entity",
      "named_thing"
    ]
  }
}

Interesting. The SRI normalizer has it all by itself in a cluster of 1. Apparently the only curie for something called "insulin". The NodeSynonymizer has a group of 70 (!) curies that are synonyms of insulin (including CHEBI:5931) At ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:145810 it appears that 145810 (insulin) is the parent term of 5931 (insulin human) and KG2 only has CHEBI:5931

So, SRI Node Normalizer knows about it, but has no synonyms NodeSynonymizer doesn't know about it because it's not in NodeNamesDescriptions_KG2.tsv and is not linked by SRI Node Normalizer.

Possible remedies are:

A stickler for ontologies might claim that the parent term 'insulin' and its child "insulin human' should not be lumped together. But probably many of the 70 curies in the current cluster conflate insulin in general and human insulin, so that's a potential problem with such an automated system summarizing messy data.

The next step is unclear to me.

edeutsch commented 4 months ago

NodeSynonymizer rebuilt long ago. probably no longer relevant, closing.