Add "preferred" identifiers to bulk download

amykglen commented 1 year ago

Hello - we've started using your KGX bulk download for the NodeNormalizer, and it's been great - thanks for making that!

However, we still have to also query your API because there doesn't seem to be any indication of the "preferred" identifier for each group of equivalent nodes in the bulk download.

For example, the "preferred" identifier for the concept 'water' seems to be PUBCHEM.COMPOUND:962, which is reported under [input_curie] --> "id" --> "identifier" in the NodeNormalizer RestAPI /get_normalized_nodes response:

curl -X 'GET' \
  'https://nodenormalization-sri.renci.org/1.3/get_normalized_nodes?curie=MESH%3AD014867&conflate=true' \
  -H 'accept: application/json'

{
  "MESH:D014867": {
    "id": {
      "identifier": "PUBCHEM.COMPOUND:962",
      "label": "Water"
    },
    "equivalent_identifiers": [
      {
        "identifier": "PUBCHEM.COMPOUND:962",
        "label": "Water"
      },
      {
        "identifier": "CHEMBL.COMPOUND:CHEMBL1098659",
        "label": "WATER"
      },
      {
        "identifier": "UNII:059QF0KO0R",
        "label": "WATER"
      },
      ...

While in the bulk JSON lines KGX nodes file, I see equivalent_identifiers, but no indication of the "preferred" identifier for each cluster:

{"id": "UMLS:C0448837", "name": "Skin structure of female perineum", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C0448837"]}
{"id": "UMLS:C0447166", "name": "Lower jugular lymph node", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C0447166", "NCIT:C132512"]}
{"id": "NCIT:C132512", "name": "Lower Jugular Lymph Node Group (Level IV)", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C0447166", "NCIT:C132512"]}
{"id": "UMLS:C4020414", "name": "subcentral operculum", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C4020414"]}
{"id": "UMLS:C0832304", "name": "Bone of head of phalanx of right middle finger", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C0832304"]}
{"id": "UBERON:4300154", "name": "procurrent spur", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UBERON:4300154"]}
{"id": "UMLS:C0821524", "name": "Trabecular bone of left pedicle of twelfth thoracic vertebra", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C0821524"]}
{"id": "UMLS:C0924689", "name": "Trabecular bone of proximal phalanx of left third toe", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C0924689"]}
{"id": "UMLS:C1283383", "name": "Spinothalamic tract of medulla", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C1283383"]}
{"id": "UMLS:C0224428", "name": "Entire tensor fascia lata", "category": "biolink:AnatomicalEntity", "equivalent_identifiers": ["UMLS:C0224428"]}

It would really help us out if you could add this. Thanks!

gaurav commented 7 months ago

@cbizon Would it be okay if I added a preferred_id node property to store the preferred id for any clique? Or is there an existing KGX property that would be better suited for this?

cbizon commented 7 months ago

Isn't "id" the preferred id?

TranslatorSRI / NodeNormalization

Add "preferred" identifiers to bulk download #187