MLforHealth / MIMIC_Extract

MIMIC-Extract:A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III
MIT License
412 stars 122 forks source link

What does statics.max_hours mean? I can not get this column. #26

Closed xlbryantx closed 3 years ago

xlbryantx commented 4 years ago

I execute 'mimic_direct_extract.py' and I got the same file as the instruction described. I read statics information by: statics = pd.read_hdf(LEVEL2, 'patients') and the shape of 'statics' is (34472, 27), which lack the column of 'max_hours'. So, I want to know what 'max_hours' mean?

mmcdermott commented 4 years ago

I see this column present in the current version of the code just pushed. @xlbryantx can you confirm whether or not this is still an issue for you? And, if it is, can you let me know under what parameters you're running the code?

kheuton commented 3 years ago

I also do not see this variable in the statics table (the patients table of all_hourly_data.h5 created with level 2 grouping).

I am using a recent version of the code (50b3077778f580bb110669541c054fe95ec41612), and I ran mimic_direct_extract.py with the default parameters.

Reading through mimic_direct_extract.py, it looks to me as if the data in the patients table of all_hourly_data.h5 comes directly from the statics SQL query and it then has the statics_schema applied using sanitize_df. I do not see a max_hours column in the statics schema, so it makes sense to me that this column is not appearing in the final data.

The mortality prediction baseline notebooks expect this column to be there, which is why I was looking for it.

Let me know if I should open a new issue or if re-opening this one is fine. I understand that these notebooks haven't been validated in a while and there are other open issues relating to them

mmcdermott commented 3 years ago

Thanks @kheuton. I'll make sure somebody takes a look. It may be that that column reference in the mortality prediction baseline notebook should be removed, but I'm more concerned that I did see this column back in Sep. in the currently pushed default outputs...

kheuton commented 3 years ago

Looking at the code more, I think I understand what is going on.

The max_hours column is added to the statics table as a side effect of the save_numerics function. If you run mimic_direct_extract.py without extracting numerics, you will save a statics table without the 'max_hours' column. This is what happened in my case- I ran mimic_direct_extract twice, the first job hit a timeout when extracting notes, so the second was able to skip extracting numerics.

mmcdermott commented 3 years ago

Yep, you're right @kheuton, that's what's going on. That is really not the desired API, so I'm going to close this issue and open a new bug to to shift that line into the save_pop function (or even into the static extraction SQL query itself) to avoid this issue in the future. Thanks for looking into this!

rishabhrrk commented 3 years ago

Looking at the code more, I think I understand what is going on.

The max_hours column is added to the statics table as a side effect of the save_numerics function. If you run mimic_direct_extract.py without extracting numerics, you will save a statics table without the 'max_hours' column. This is what happened in my case- I ran mimic_direct_extract twice, the first job hit a timeout when extracting notes, so the second was able to skip extracting numerics.

So how did you extract the max_hours then? Did you clean some of the tables and then ran the mimic_direct_extract.py again?

kheuton commented 3 years ago

For myself, I create the column on-demand from the statics table like so:

to_hours = lambda x: max(0, x.days*24 + x.seconds // 3600)
statics['max_hours'] = (statics['outtime'] - statics['intime']).apply(to_hours)

This is simply the logic from the save_numerics function

rishabhrrk commented 3 years ago

For myself, I create the column on-demand from the statics table like so:

to_hours = lambda x: max(0, x.days*24 + x.seconds // 3600)
statics['max_hours'] = (statics['outtime'] - statics['intime']).apply(to_hours)

This is simply the logic from the save_numerics function

For me if I used a new output folder and did a re-run of mimic_direct_extract.py worked. However, I am having trouble with Jupyter notebooks' pre-processing tasks. Were you able to run Baselines for Mortality and LOS prediction - GRU-D.ipynb?

kheuton commented 3 years ago

I was able to get the notebook working with a few small modifications. I didn't extensively document everything I changed, but it looks like I did at least these:

  1. Added the missing max_hours column to statics before creating the Ys
  2. In the GRU-D cell, I had to wrap the creation of the X_mean variable in np.nan_to_num to deal with some nans I was seeing. I forget why I think I had to do this
  3. Also in the GRU-D cell, I had to cast batch_size as an int