TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

Add support for better combination of child UMLS identifiers #368

Open gaurav opened 2 weeks ago

gaurav commented 2 weeks ago

There are a bunch of entries in UMLS such as UMLS:C1847200 "Alzheimer Disease 4" that is noted as having a broader concept (UMLS:C0002395 "Alzheimer's Disease"). Yaphet has been running into issues where MedMentions uses a more specific ID while NodeNorm can only normalize a broader ID. So it would be useful if NodeNorm had some connection between either direct broader/narrow relationships from UMLS, all broader/narrower relationships from UMLS, or some sort of threshold.

Some options, arranged from easiest to hardest:

  1. Leave it as-is, and let downstream users use MRREL from UMLS to figure out these broader/narrow relationships.
  2. Make the leftover UMLS generator much more sophisticated, so that for every UMLS ID it tries to add, it first walks up the hierarchy and tries to find a UMLS ID that has already been normalized as part of Babel. If it finds one, it includes the second ID in the existing clique, perhaps with a flag to indicate that this should be treated as an imperfect match or something.
    • Downside: we've currently implemented the leftover UMLS output as dependent on the compendia files, so we would need to reprocess those files after they've already been generated, which would be at a minimum inelegant and probably also quite hairy.
  3. If there is some kind of threshold we could use (i.e. some sort of indicator from UMLS that a particular ID is a good place to stop -- for example, UMLS:C1847200 only has a single broader concept, while UMLS:C0002395 has a ton of broader concepts), then we could turn this into a conflation and make it optional.
  4. ???