YerevaNN / mimic3-benchmarks

Python suite to construct benchmark machine learning datasets from the MIMIC-III 💊 clinical database.
https://arxiv.org/abs/1703.07771
MIT License
806 stars 329 forks source link

Question about categorical channels in LSTM input features #113

Open EduCasta opened 3 years ago

EduCasta commented 3 years ago

Dear authors, contributors, and maintainers of the repository,

First of all, I would like to thank you for the availability of this repository. I am currently working on a university project with this benchmark for the mortality prediction task, using the LSTM models.

I have a question about the final 76 features that you use for your LSTM model benchmarks. In particular, I have a question about the number of channels existing for the categorical variables. I noticed in the file “discretizer_config.json”, which is used in the discretization step, you consider as different category values that might be equal. See for example the possible values for the Glasgow Coma Verbal Response: "No Response-ETT", "No Response", "1 No Response", "1.0 ET/Trach”. These have each a different feature channel; however, they seem to represent the same value: “No Response”.

In fact, they all share the same value in the “channel_info.json” file; however, this might not have been included in the discretization. I only found one reference to this file in the function “mimic3models/common_utils.extract_features_from_rawdata”, which is unused in the discretization step for the LSTM data-preprocessing, although it is used for the logistic regression.

What I described happens in almost all categories. Would it be possible that you share a clarification on this matter?

Thank you very much for your time in advance.

Best regards, Eduardo.