MIT-LCP / mimic-omop

Mapping the MIMIC-III database to the OMOP schema
MIT License
128 stars 48 forks source link

Extracting numbers from free text fields #34

Open alistairewj opened 6 years ago

alistairewj commented 6 years ago

I was looking through the PRESCRIPTIONS -> DRUG_EXPOSURE ETL, and came across the use of the extract_value function here: https://github.com/MIT-LCP/mimic-omop/blob/68e521268137abe2f5f44fd03c104f48aa5b3d2f/etl/StandardizedClinicalDataTables/DRUG_EXPOSURE/etl.sql#L40

The function seems to extract numbers from text - which is great - but it makes the assumption that either commas or decimals are used as the decimal separator, e.g. it allows for 1.05 and 1,05 to both map to the same number. However this is a regional thing, and in MIMIC a comma separator is almost always used as a thousandths separator.

For MIMIC, prescriptions, we see this is bad:

select
prod_strength
, count(*) as nobs
, max(dose_val_rx) as dose_str
, extract_value(max(dose_val_rx)) as dose_numeric
from prescriptions
where dose_val_rx like '%,%'
group by prod_strength
order by count(*) desc
limit 10;

Returns:

prod_strength nobs dose_str dose_numeric
25,000 unit Premix Bag 31532 25,000 25
10,000 Units / mL - 5 mL Vial 842 25,000 25
10,000 Unit Vial 588 55,000 55
50,000 Unit Capsule 324 50,000 50
5000 Units / mL- 1mL Vial 194 30,000 30
500,000 Unit Tablet 110 500,000 500
200mg/mL-4mL Vial 104 16,000 16
10,000 Unit Capsule 97 50,000 50
20,000 Unit Vial 85 60,000 60
500,000 Unit UDCUP 75 500,000 500

I think there should be one function for European ETL and one function for non-European ETL. I am going to separate the functions and have extract_value_decimal_sep and looks_like_value_decimal_sep to be used with UK/American style separators. I'll fix this for prescriptions and leave this issue open because we should check the other ETLs for this bug. grep tells us to look at:

alistairewj commented 6 years ago

Updated function here: https://github.com/MIT-LCP/mimic-omop/commit/e1f5b984f2a0b120591aa2eb09fe7514453d5bc7

Both extracts use looks_like_value to check if it's a valid number (no matter the format). Also did a fix to pick up the number 1,000,000 here https://github.com/MIT-LCP/mimic-omop/commit/ed7896ad87dabdcdbca6a3c363dc884ab0816d40

I'll see about adding some tests for this before I close the issue..

parisni commented 6 years ago

great Not reviewed anything but be carefull: that function is also used for labs values. Then this might break things there, and we may use different function

alistairewj commented 6 years ago

Okay, let me check!

alistairewj commented 6 years ago

Hm I see no differences but the ETL seems broken for negative lab values - it drops the negative.