TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
8 stars 2 forks source link

Mesh Proteins as chemicals #200

Open cbizon opened 8 months ago

cbizon commented 8 months ago

See https://github.com/NCATSTranslator/Feedback/issues/613 https://github.com/NCATSTranslator/Feedback/issues/614 https://github.com/NCATSTranslator/Feedback/issues/615.

These are all proteins, which under biolink are biological entities, but we're calling them chemicals. I think that this is probably just never cleaned up from when protein went over into the biological entity branch.

cbizon commented 8 months ago

I took a look at the first one here. https://id.nlm.nih.gov/mesh/D011972.html (Insulin receptor). According to the mesh code, this MESH id should not be included as a Chemical. As that URL shows, the Tree values are D12.776 and D08, both of which are excluded in the chemical.py mesh filter. Not sure at this point whether the MESH is somehow getting into the chemical id list or if we're looking at an old result somehow or what.

cbizon commented 8 months ago

OK, what I think is going on is that the MESH terms are correctly being put under Protein, but the UMLS are still getting called ChemicalEntities. Then the MESH terms are getting dragged along via a mapping. And I think that the reason that the UMLS are not working corrrectly is that our list of UMLS Tree id's doesn't use excludes. So Insulin Receptor has three listings in MRSTY:

C0034818|T116|A1.4.1.2.1.7|Amino Acid, Peptide, or Protein|AT17641609|256|
C0034818|T126|A1.4.1.1.3.3|Enzyme|AT17738045|256|
C0034818|T192|A1.4.1.1.3.6|Receptor|AT17615610|256|

So even though we don't let in Receptor, we do let in Enzyme. We need to instead say "if you are a receptor, you don't go here, no matter what your other listings say"

cbizon commented 8 months ago

It also looks like 1.4.1.2.1.7 is being grabbed by protein. So basically we need to

  1. make sure that this branch of UMLS is correctly divided between chemical.py and protein.py,
  2. correctly handle exclusions at the UMLS id level
  3. Ensure that somehow the UMLS/MESH versions of proteins merges with the UNIPROT/PR/HGNC versions.