performance - Githubissues

vztu commented 3 years ago

Hi Emma, great work!

I am wondering why your R2 performance is much lower than those reported in this paper: https://github.com/mostafaalishahi/eICU_Benchmark

BTW, what is the current (reliable) SOTA performance?

EmmaRocheteau commented 3 years ago

Great question!

I don't think there is such a thing as SOTA on the metrics for a particular dataset. I'll explain what I mean: the metrics (especially MSE and R² and to a lesser extent MAD) are very sensitive to particular preprocessing decisions you make. This is because the length of stay (LoS) distribution has significant positive skew - there are far more short stay patients than long stay patients and the tail end of the distribution is very long with some patients staying up to 100 days!

You'll notice the MIMIC results seem a fair bit "worse" than the eICU ones (especially in the metrics I highlighted), and that's because the skew is more extreme. In the MIMIC cohort, 8.65% of patients stay longer than 10 days whereas only 4.71% patients do in eICU. The reason this manifests in MSE, R² and MAD is because they penalise absolute rather than proportional error. Proportional error makes more sense when there is positive skew like this. If you imagine we have a prediction error of 5 days for a particular patient - in the context of a 2-day stay, that seems quite bad, whereas in a 30-day stay the same error seems good!

Therefore, if you have this level of positive skew in the data, it means that a few patients with long LoS will manifest as huge errors in MSE and they will heavily downgrade the R² fit.

The reason for the discrepancy with the above paper is probably because the preprocessing decisions have resulted in a cohort with much less skew. Maybe they have capped the maximum LoS they are predicting for example. When you compare the performance of the same model - a simple LSTM - on their cohort and mine the results look very different. This means that the answer is in the dataset not the models. My preprocessing is fairly similar to Harutyunyan et al. 2019 (except that I extract a lot more features), where you'll notice my LSTM and CW LSTM results on MIMIC are freakishly similar to their regression set up! - even though I'm using MIMIC-IV not MIMIC-III. The reason I chose to be as broad as possible in the preprocessing (selecting everyone who stayed more than 5 hours) is to resemble the real bed management task as closely as possible.

To summarise, I would say that the important thing is not the maximum performance I could illicit from this dataset - I could curate an easy dataset for these tasks if I wanted; I could just select all the short stay patients and predict only on those. Much more important is which model is consistently doing relatively better than others on the same data. The "SOTA" is therefore the model, not the metrics.

EmmaRocheteau commented 3 years ago

I should add that I didn't have access to the above repository when I started the project (it wasn't yet public), so matching the data wasn't possible. MIMIC-IV is also very new, so there isn't a benchmark paper available for that yet. It would be interesting to try TPC on their cohort at some point, it would do very well.

vztu commented 3 years ago

Thank you very much Emma! I didn't expect the pre-processing to matter that much, especially there are no standards on this. Your explanation really helps a lot!

EmmaRocheteau / TPC-LoS-prediction

performance #1