TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

2024oct24 defaults UMLS IDs to proteins rather than chemical entities #370

Open gaurav opened 2 weeks ago

gaurav commented 2 weeks ago

In 2024oct24, we have compendia like so:

$ grep "UMLS:C0949788" *.txt
ChemicalEntity.txt:{"type": "biolink:ChemicalEntity", "ic": null, "identifiers": [{"i": "MESH:D027301", "l": "Fusion Regulatory Protein 1, Light Chains", "d": [], "t": []}, {"i": "UMLS:C0949788", "l": "Antigens, CD98 Light Chains", "d": [], "t": []}], "preferred_name": "Fusion Regulatory Protein 1, Light Chains", "taxa": []}
Protein.txt:{"type": "biolink:Protein", "ic": null, "identifiers": [{"i": "UMLS:C0949788", "l": "Antigens, CD98 Light Chains", "d": [], "t": []}], "preferred_name": "Antigens, CD98 Light Chains", "taxa": []}
^C
$ grep "UMLS:C0019630" *.txt
ChemicalEntity.txt:{"type": "biolink:ChemicalEntity", "ic": null, "identifiers": [{"i": "MESH:D000949", "l": "Histocompatibility Antigens Class II", "d": [], "t": []}, {"i": "UMLS:C0019630", "l": "Histocompatibility Antigens Class II", "d": [], "t": []}], "preferred_name": "Histocompatibility Antigens Class II", "taxa": []}
Protein.txt:{"type": "biolink:Protein", "ic": null, "identifiers": [{"i": "UMLS:C0019630", "l": "Histocompatibility Antigens Class II", "d": [], "t": []}], "preferred_name": "Histocompatibility Antigens Class II", "taxa": []}
^C

This is identical to what we see in 2024oct1:

$ grep "UMLS:C0019630" *.txt
ChemicalEntity.txt:{"type": "biolink:ChemicalEntity", "ic": null, "identifiers": [{"i": "MESH:D000949", "l": "Histocompatibility Antigens Class II", "d": [], "t": []}, {"i": "UMLS:C0019630", "l": "Histocompatibility Antigens Class II", "d": [], "t": []}], "preferred_name": "Histocompatibility Antigens Class II", "taxa": []}
Protein.txt:{"type": "biolink:Protein", "ic": null, "identifiers": [{"i": "UMLS:C0019630", "l": "Histocompatibility Antigens Class II", "d": [], "t": []}], "preferred_name": "Histocompatibility Antigens Class II", "taxa": []}

I'm guessing this difference is caused by some difference in how we loaded this data (i.e. maybe we loaded proteins before chemical entities in 2024oct24). For 2024oct24, I'm going to see if I can figure out some way of preferring chemical entities in NodeNorm frontend. Otherwise, we'll need to decide whether to try to crack the protein/chemical entity duplication issue (#276) before our next release.