callahantiff / OMOP2OBO

OMOP2OBO: A Python Library for mapping OMOP standardized clinical terminologies to Open Biomedical Ontologies
http://tiffanycallahan.com/OMOP2OBO_Dashboard
MIT License
80 stars 12 forks source link

Improve string delimiter detection in mapping pipline #45

Open callahantiff opened 3 years ago

callahantiff commented 3 years ago

Describe the Bug

An assumption is made that all concept synonyms and ancestor information will be input in an aggregated format with each aggregated concept separated by a | delimiter. That's a brittle assumption that should be improved. Examples of specs for input data can be found here: resources/clinical_data/README.md

EXAMPLE:
Input Data
The CONCEPT_SYNONYM column below displays data in the expected input format

CONCEPT_ID CONCEPT_SOURCE_CODE CONCEPT_LABEL CONCEPT_SOURCE_LABEL CONCEPT_SYNONYM
37018594 snomed:80251000119104 Complement level below reference range Complement level below reference range Complement level below reference range | Complement level below reference range (finding)

Example of Data that Breaks Assumptions:
The CONCEPT_SYNONYM column below displays data in an unexpected input format (i.e. two types of delimiters | and ;)

CONCEPT_ID CONCEPT_SOURCE_CODE CONCEPT_LABEL CONCEPT_SYNONYM
40771573 loinc:69052-9 Flow cytometry specialist review of results Flow cytometry specialist review of results | Flow cytometry specialist review | Dynamic; Impression; Impression/interpretation of study; Impressions; Interp; Interpretation; Misc; Miscellaneous; Narrative; Other; Point in time; Random; Report; To be specified in another part of the message; Unspecified


Impact Level

LOW - the string similarity mapping pipeline correctly handles all types of pipings allowing the recovery of missed mappings in the exact match part of the pipeline.

Impacted Scripts

omop2obo/clinical_concept_annotator.py

Solution

callahantiff commented 3 years ago