biothings / pending.api

Set of standalone APIs built with the BioThings SDK for the Translator Project
https://biothings.ncats.io
Apache License 2.0
5 stars 11 forks source link

Modify SuppKG parser to better deal with fake UMLS IDs #220

Open andrewsu opened 1 month ago

andrewsu commented 1 month ago

We created an API for SuppKG in https://github.com/biothings/pending.api/issues/55 and https://github.com/biothings/biothings_explorer/issues/706. We previously noted that SuppKG created UMLS-like identifiers (which have the format "DCXXXXXXX" instead of "CXXXXXXX"). At the time, we decided to treat them as if they were UMLS IDs, but now that is resulting in some confusing results (e.g., https://github.com/NCATSTranslator/Feedback/issues/836), so it's time to adjust this behavior.

Vlado helped map these fake UMLS "DC" IDs to more common identifiers, the results of which are in supp_kg_chem_nodes.txt. To summarize those results, there were 56636 IDs for suppkg nodes, 53707 of which start with "C" -- we assume these are valid UMLS. Of the remaining 2928 whose IDs that start with "DC", Vlado mapped 841 of those to CHEBI, CID, UNII, MESH, etc. In our parser script, let's replace the "DC" IDs for these IDs in our API. For the remaining 2087 nodes for which Vlado could not find mappings, let's delete records using those IDs in our API.

An analysis of the namespaces used for the 841 (262 are mapped to multiple identifiers):

$ grep '^D' supp_kg_chem_nodes.tsv  | gawkt '$3>0{print $NF}' | tr '|' '\n' | sed 's/:.*//' | sort | uniq -c | sort -k1nr
    626 CHEBI
    298 CID
    181 UNII
     78 MESH
     38 ChEMBL
     19 PHARMGKB.CHEMICAL
      6 CHEMBL.TARGET
      3 HMDB
      2 CAS
      2 DrugBank
colleenXu commented 1 month ago

Can we map the 6 CHEMBL.TARGET entities to a different ID namespace? Or remove them? It's an odd identifier for a chemical and NodeNorm doesn't really support that ID namespace (example automated test issue).

I also wonder about adjusting some ID-prefixes to the Translator format:

andrewsu commented 1 month ago

great points, thanks @colleenXu. Yes, let's revisit these details when we identify someone to work on this issue.