MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.59k stars 1.52k forks source link

Heights and Weights in MIMIC-IV v2.2 #1781

Open pszolovits opened 2 months ago

pszolovits commented 2 months ago

In pulling data about ICU patients in MIMIC-IV v2.2, I have noticed some data errors and peculiarities involving height and weight data.

Heights

One extreme height

subject_id hadm_id stay_id caregiver_id charttime storetime itemid value valuenum valueuom warning
10460364 28338230 32312517 40361 2146-04-11 23:15:00 2146-04-12 10:34:00 226707 825070 825070 Inch 0

Even Paul Bunyan is not 68,000 feet tall!

Most heights are duplicated between itemids 226707 (inches) and 226730 (cm).

For 33,700 of the 33,707 height measurements recorded in chartevents each pair of inch and cm entries have identical metadata (subject_id, hadm_id, stay_id, charttime) and all but 69 of those values fall within 1 cm of each other when we convert inches to cm (i.e., inch * 2.54). None (except the extreme value noted above) differ by more than 2 cm.

For the 7 pairs of height measurements where the metadata don't match, the difference is only in stay_id, which appear to be assigned differently between the inch and cm data. The storetimes also differ, though I did not check if they ever differ for other cases. The inch and cm values are compatible even for these cases.

Since unit conversions are easy, I wonder if it might be OK to drop either the inch or cm measurements, since they are almost totally equivalent. I don't know if BIDMC records the original data in one or the other unit, but that might be the right one to keep. Perhaps the recording convention changed during the data collection period.

Weights

Many different sources of weights

It appears that weights are recorded under five different itemids, four of them as part of chartevents and one as part of inputevents. In fact, because every input event has an associated weight, this source provides the vast majority of data: itemid n label
224639 202,491 Daily Weight
226512 67,826 Admission Weight (Kg)
226531 119,251 Admission Weight (lbs.)
226846 7,223 Feeding Weight
any 5,159,302 from inputevents' patientweight, de-duplicated

Extreme weight values

Although there are extreme variations in human weight, some of the recorded weights are highly likely to be artifacts. For example, each source of weight data contains maximum weights over 5,000 kg and minimum weights of 1 kg or less (or negative!).

Adults tend not to weigh less than 25 kg or more than 300 kg, though rare exceptions may occur. Here are the number of such recorded exceptions. itemid label < 25 kg > 300
224639 Daily Weight 667 99
226512 Admission Weight (Kg) 111 21
226531 Admission Weight (lbs.) 235 48
226846 Feeding Weight 108 1
any from inputevents 15,542 4,507

It seems reasonable just to drop such extreme values of weight.

Issues with weights from inputevents

A very large number of weights derived from inputevents (8,978,893) are duplicated (same person, same time) probably because multiple inputs were ordered simultaneously, each recorded with a weight. Eliminating such duplicates leave 5,159,302 weights for 72,690 ICU stays. A significant number (2,836) are recorded as implausibly high, namely > 300 kg. There are a very few patients with such weights (> 660 lbs), but I suspect that nearly all such recordings are artifacts.

When I look at the consistency of these weights, most are consistent within an ICU stay, but some are highly inconsistent. Of these 72,690 stays, in 69,171, the minimum and maximum weights from inputevents are identical. However, for 3,088 stays, there are weight differences > 1kg, where some are quite large: image

Presumably, the individual outliers with huge recorded weight changes during a stay are artifacts. Surprisingly, however, looking just at the more plausible end of this spectrum, we see many weight spans (max minus min weight) of up to 100 kg (or more), most of which are still almost certainly artifacts. I am not sure what to make of this. For now, I am leaving these data as recorded except for having eliminate the low and high-weight outliers. There is so much data that the bad data may be washed out.

Correlations among average weights per stay

If we look at the mean of weights recorded during each ICU stay, they are highly correlated except for "Feeding Weight". The daily weights are "only" at ~0.95, whereas the inputs and admission weights are nearly at 1.00. 0 224639 226512 226531 226846
0 1.000 0.949 0.996 0.997 0.699
224639 0.949 1.000 0.954 0.949 0.669
226512 0.996 0.954 1.000 0.997 0.703
226531 0.997 0.949 0.997 1.000 0.698
226846 0.699 0.669 0.703 0.698 1.000

Or, visually: image

alistairewj commented 2 months ago

Even Paul Bunyan is not 68,000 feet tall!

😂

There's also the patient with a blood pressure higher than the atmospheres at the Mariana Trench. Forgot to take their lisinopril I assume :)

Since unit conversions are easy, I wonder if it might be OK to drop either the inch or cm measurements, since they are almost totally equivalent. I don't know if BIDMC records the original data in one or the other unit, but that might be the right one to keep. Perhaps the recording convention changed during the data collection period.

Probably, I didn't realize but Metavision stores a computed value for the other. I think dropping the computed value makes sense. I think I've done this once and forgotten, but I meant to look for heaping in the data to determine which one was the "original" measurement (e.g. if inches heap and cm have decimals, then the original unit is inches). Still, I'm pretty sure the unit of documentation is inches.

Issues with weights from inputevents

Presumably, the individual outliers with huge recorded weight changes during a stay are artifacts. Surprisingly, however, looking just at the more plausible end of this spectrum, we see many weight spans (max minus min weight) of up to 100 kg (or more), most of which are still almost certainly artifacts. I am not sure what to make of this. For now, I am leaving these data as recorded except for having eliminate the low and high-weight outliers. There is so much data that the bad data may be washed out.

Interesting, never noticed that. Certainly a swing of 10kg is very reasonable due to fluid balance changes, but 100kg seems.. less reasonable?

Correlations among average weights per stay

If we look at the mean of weights recorded during each ICU stay, they are highly correlated except for "Feeding Weight". The daily weights are "only" at ~0.95, whereas the inputs and admission weights are nearly at 1.00.

Ah I had always wondered if daily weight was populated from admit or daily, and I guess you've found it's from admit. All this does seem to justify the choice to only include admit/daily weight in the weight durations query, which tries to assign start/stop times of weight to patients throughout their ICU stay: https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iv/concepts/demographics/weight_durations.sql