YerevaNN / mimic3-benchmarks

Python suite to construct benchmark machine learning datasets from the MIMIC-III 💊 clinical database.
https://arxiv.org/abs/1703.07771
MIT License
785 stars 323 forks source link

Advice for project.. #12

Closed Bowiemb closed 6 years ago

Bowiemb commented 7 years ago

@Harhro94 @Hrant-Khachatrian

I would love your advice on a conference paper that we are working on here at Oak Ridge National Laboratory.

We would love to contribute a time-series anomaly detection benchmark for ICU events. Basically, the idea is we would predict what the next "event" would be for each patient stay, including lab values, diagnoses, demographic information. After the model is fully trained, we would then compare this prediction to what actually happened in the time-series. If there is a larger than expected error, then we would flag this event (preferably something specific in the event) as an anomaly. This technique is detailed here.

I imagine this should be fairly straightforward; however, I was curious to get your advice based on your experience with creating LSTMS for this dataset in particular.

Thank you!

My best, Michael

P.S- We will of course be citing your fantastic work, here.

alistairewj commented 7 years ago

After defining a suitably appropriate event (death, cardiac arrest, etc), the biggest challenge I see in anomaly detection is defining the evaluation metric. If we alert 6 hours prior to the event, is that a true positive or a false positive? What about 18 hours? There are some proposed qualitative measures where you plot the early alerting time vs. precision, but I don't think it's a solved problem. Worth considering!

Bowiemb commented 7 years ago

The event definition would be simply the lab results, diagnoses, and demographic information.

For example, we would predict an event 'E' at time-step 'T' to be as follows: blood pressure- 38, glucose-200, oxygen-68, diagnosis-M05, etc.

Then, we compare the error between predicted 'E' to the true 'E'. If the error is greater than the average (or maybe highest for less sensitivity) testing error, then we would label this an anomaly.

This would be one of a couple experimental implementations...

Hrant-Khachatrian commented 7 years ago

This is a quite interesting problem.

I think you can try to formulate it the following way. Train a model on the patients that were successfully treated in ICUs. Then try to see if there are visible anomalies for deceased patients.

I agree that the hardest part is proper evaluation. First thing that comes to mind is the following. Split training, validation and test sets into two parts based on the in-hospital mortality label. Train the model on train_0, use validation_0 and validation_1 to choose anomaly thresholds. Evaluation metric can be the number of anomalies detected on test_1 minus the number of anomalies detected on test_0 (you want to maximize this difference). Then, as @alistairewj pointed out, you can add weights to the anomalies... the earlier you detect anomalies for test_1 the better.

alistairewj commented 7 years ago

Ah, I see what you are suggesting. I think without more thought into exactly which variables you are planning to predict it might not serve much of a purpose. In particular, I think being able to predict "expensive" measurements, like PaO2, is of interest, because any resultant algorithm would have practical clinical utility, but being able to predict peripheral oxygen saturation is less meaningful. What do you think?

Bowiemb commented 7 years ago

@Hrant-Khachatrian Excellent. In regards to evaluation, I was thinking of setting it up as a semi-supervised problem. For example, we would make the assumption that the MIMIC-III dataset is an anomaly free dataset and the LSTM would successful learn to predict the next lab values of each hour a patient is in ICU. Then, we would have a second, erroneous copy of MIMIC (that we would screw up with a script). Thus, we would re-test the LSTM on the erroneous copy, and if the error between a prediction and the "true value" in the erroneous dataset is large enough (to be defined), then it would be labelled an anomaly (or an outlier at least). So in regards to evaluation, we could test the false positive / false negative rates of the erroneous dataset.

@alistairewj That's a very interesting point. Initially, our goal for the anomaly detection would be to assure doctors that they are looking at accurate measurement. For example, a user story for a doctor may be, "As a doctor, I made a wrong diagnosis/prescription because the decision was based on undetected erroneous / anomalous data. Since manually combing through potentially thousands or millions of data points is expensive and time-consuming, I need a system that can automatically detect anomalies in data."

alistairewj commented 7 years ago

Hmmm I don't like the idea of artificially creating bad data which you train an algorithm to detect. In the end you will just train a model to replicate the results of your noise algorithm.

Bowiemb commented 7 years ago

@alistairewj I'm sorry.. I don't think I'm communicating this properly. The paper detailing this technique is here.

I wouldn't be training a model to detect bad data. I would be training a model on clean data to successfully predict the next time series. Then, when the model is tested on data with errors then there will be a larger prediction error on the anomalous record than the true records.

For example, let's say I train a model to count [1, 2, 3, 4, 5, ... , 100], and it can successfully predict 2 comes after 1, 3 comes after 2, and so on... If that same model is compared to a dataset of [1, 2, 3, 67, 4, 5 ..., 100]. The model would predict that 4 comes after 3. But, the value was 67. We would measure that error and label the '67' as an anomaly.

Bowiemb commented 7 years ago

Here's a very very basic proof of concept: https://github.com/ebegoli/ErrorProne/blob/master/sandbox/AD-LSTM%20Proof%20of%20Concept.ipynb

alistairewj commented 7 years ago

Yes OK that makes sense. I think I'm with you on the approach - but the context for the approach is different in my mind. I think you are looking to detect "bad data" and alert the clinician to it -- but my feeling is that it makes more sense to talk about abnormal state, i.e. this measurement is drastically different than what we expected and we should alert the clinician that the patient may be deteriorating.

Bowiemb commented 7 years ago

Correct. Admittedly, I could be missing something... wouldn't 'bad data' and 'abnormal states' both be anomalies? With that being said, do you think the LSTM you've already created for decompensation could be used/tweaked to also pick out 'bad data'? The reason we've been focusing on 'bad data' is because our sponsors tell us that it is a problem for them.

Eventually, it would be great if these kinds of techniques are generalized sufficiently to alert clinicians to many kinds of 'anomalies' including bad data, abnormal states, anomalous clusters of cohorts within a cohort, possibly even spotting emerging trends inside of a population.

alistairewj commented 7 years ago

Yep, possible. I guess it depends on the data you are working with. Would be interesting to hear what kind of data your sponsors refer to.

Bowiemb commented 7 years ago

@alistairewj For proof of concept, we would be working with MIMIC data as structured by your scripts in this repo (assumed to be the clean data). Then, we would run a script to generate errors in a copy of the MIMIC data which would be the test data.

However, eventually we are aiming for a robust model for this data.

Hrant-Khachatrian commented 7 years ago

My intuition is that it will be very hard to add "realistic" noise/errors to the data. It could be harder than anomaly detection :) Your goal is to detect real events that are not "healthy", and not random errors in the data, correct?

Hrant-Khachatrian commented 7 years ago

ok, @Bowiemb, I see you have (almost) answered my question above. So if you train on full MIMIC data and test it on randomly corrupted copy of MIMIC, your algorithm may not be able to detect "abnormal states", but also your evaluation setup will not let you know whether your model can find these states.

That is why I suggest to train only on relatively "healthier" subset of the data.

Bowiemb commented 7 years ago

@Hrant-Khachatrian The main focus would be data quality errors like missing values, switched values (height and weight), outliers, systematic bias (using a clock that is 10 minutes fast on accident), random bias (accidentally picking the wrong drop-down menu or check-box), impossible values (male/female only ICD-9 code for the wrong gender), etc.

Edit

As a secondary focus, it would be nice to make this anomaly detection algorithm general enough to highlight potential "up-coding" as well. As you may know, this is over a $2 billion dollar problem.

Lastly, if it could highlight anomalous "un-healthy" states, that would be a bonus, but not our focus for this project.