biothings / mydisease.info

7 stars 8 forks source link

UMLS parser and dumper improvements #48

Closed ravila4 closed 2 years ago

ravila4 commented 2 years ago

Implement automatic version checking, but keep manual file dumping

(Closes #46) Automated dumping is not possible because the files are behind an authorization portal. Summary of workflow:

Revise document merging strategy

Previously, documents were being merged with on_duplicates set to "ignore". I think this is not the best merging strategy.

Duplicate ids happen because we query mydisease.info to fetch _id of documents, and UMLS can have a many-to-many relationship with the primary key.

For example: MONDO:0005160 is mapped to multiple CUIs: ['C0003486', 'C0265010', 'C0265012', 'C0741160', 'C1305122'], which results in multiple UMLS documents with _id = MONDO:0005160

Implemeted solution

In the above case, the merged document looks like this:

[
  {
    "_id": "MONDO:0005160",
    "umls": {
      "icd10": {
        "preferred": [
          "I71.3",
          "I71.1",
          "I71.5"
        ],
        "non-preferred": "I71.8"
      },
      "icd10am": {
        "preferred": [
          "I71.3",
          "I71.1",
          "I71.5"
        ],
        "non-preferred": "I71.8"
      },
      "icd10cm": {
        "preferred": [
          "I71.3",
          "I71.1",
          "I71.5"
        ],
        "non-preferred": [
          "I71.8",
          "I71.9"
        ]
      },
      "snomed": {
        "preferred": [
          "155423003",
          "73067008",
          "195264004",
          "195258006",
          "34365005",
          "155419006",
          "195269009",
          "67362008",
          "155424009",
          "14336007",
          "195265003"
        ],
        "non-preferred": [
          "155423003",
          "195615002",
          "73067008",
          "195264004",
          "195258006",
          "34365005",
          "67362008",
          "195269009",
          "155419006",
          "155424009",
          "14336007",
          "195265003"
        ]
      },
      "icd9cm": {
        "non-preferred": [
          "441.3",
          "441.1",
          "441.5",
          "441.6"
        ]
      },
      "umls": [
        "C0265010",
        "C0003486",
        "C0265012",
        "C0741160",
        "C1305122"
      ],
      "mesh": {
        "preferred": [
          "D001019",
          "D001014"
        ]
      },
      "nci": {
        "preferred": [
          "C27046",
          "C26697",
          "C27198"
        ],
        "non-preferred": "C27299"
      }
    }
  }
]

Additional fixes: