Closed rxk2rxk closed 1 year ago
Upon further research, this duplication is not reflected in the UMLS Knowledge Base (umls_2022_ab_cat0129.jsonl), due to the deduplication in export_umls_json.py, so it should not impact most users; I am calling read_umls_concepts directly from my code so noticed it. The issue could be fixed at the source (umls_utils.py) but probably not a high priority.
Hi,
The scispaCy UMLS concepts dictionary (concept_details) in umls_utils.py has many cases of duplicate aliases for the same concept_id (i.e., duplicate strings in the "aliases" list). It has 5,341,734 aliases but only 4,358,081 are unique.
Attached is the most egregious example (concept_id "C0979217"). The dictionary entry for this CUI has only 17 unique aliases out of 506. For example, "OXYGEN 99 L in 100 L RESPIRATORY (INHALATION) GAS" appears 357 times.
Thanks, Ron
concept_details.CUI=C0979217.txt