allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.72k stars 229 forks source link

Duplicate aliases in UMLS concepts dictionary #482

Closed rxk2rxk closed 1 year ago

rxk2rxk commented 1 year ago

Hi,

The scispaCy UMLS concepts dictionary (concept_details) in umls_utils.py has many cases of duplicate aliases for the same concept_id (i.e., duplicate strings in the "aliases" list). It has 5,341,734 aliases but only 4,358,081 are unique.

Attached is the most egregious example (concept_id "C0979217"). The dictionary entry for this CUI has only 17 unique aliases out of 506. For example, "OXYGEN 99 L in 100 L RESPIRATORY (INHALATION) GAS" appears 357 times.

Thanks, Ron

concept_details.CUI=C0979217.txt

rxk2rxk commented 1 year ago

Upon further research, this duplication is not reflected in the UMLS Knowledge Base (umls_2022_ab_cat0129.jsonl), due to the deduplication in export_umls_json.py, so it should not impact most users; I am calling read_umls_concepts directly from my code so noticed it. The issue could be fixed at the source (umls_utils.py) but probably not a high priority.