Closed xlbryantx closed 3 years ago
I see this column present in the current version of the code just pushed. @xlbryantx can you confirm whether or not this is still an issue for you? And, if it is, can you let me know under what parameters you're running the code?
I also do not see this variable in the statics table (the patients table of all_hourly_data.h5 created with level 2 grouping).
I am using a recent version of the code (50b3077778f580bb110669541c054fe95ec41612), and I ran mimic_direct_extract.py with the default parameters.
Reading through mimic_direct_extract.py, it looks to me as if the data in the patients table of all_hourly_data.h5
comes directly from the statics SQL query and it then has the statics_schema applied using sanitize_df
. I do not see a max_hours
column in the statics schema, so it makes sense to me that this column is not appearing in the final data.
The mortality prediction baseline notebooks expect this column to be there, which is why I was looking for it.
Let me know if I should open a new issue or if re-opening this one is fine. I understand that these notebooks haven't been validated in a while and there are other open issues relating to them
Thanks @kheuton. I'll make sure somebody takes a look. It may be that that column reference in the mortality prediction baseline notebook should be removed, but I'm more concerned that I did see this column back in Sep. in the currently pushed default outputs...
Looking at the code more, I think I understand what is going on.
The max_hours
column is added to the statics table as a side effect of the save_numerics function. If you run mimic_direct_extract.py
without extracting numerics, you will save a statics table without the 'max_hours'
column. This is what happened in my case- I ran mimic_direct_extract twice, the first job hit a timeout when extracting notes, so the second was able to skip extracting numerics.
Yep, you're right @kheuton, that's what's going on. That is really not the desired API, so I'm going to close this issue and open a new bug to to shift that line into the save_pop
function (or even into the static extraction SQL query itself) to avoid this issue in the future. Thanks for looking into this!
Looking at the code more, I think I understand what is going on.
The
max_hours
column is added to the statics table as a side effect of the save_numerics function. If you runmimic_direct_extract.py
without extracting numerics, you will save a statics table without the'max_hours'
column. This is what happened in my case- I ran mimic_direct_extract twice, the first job hit a timeout when extracting notes, so the second was able to skip extracting numerics.
So how did you extract the max_hours then? Did you clean some of the tables and then ran the mimic_direct_extract.py again?
For myself, I create the column on-demand from the statics table like so:
to_hours = lambda x: max(0, x.days*24 + x.seconds // 3600)
statics['max_hours'] = (statics['outtime'] - statics['intime']).apply(to_hours)
This is simply the logic from the save_numerics function
For myself, I create the column on-demand from the statics table like so:
to_hours = lambda x: max(0, x.days*24 + x.seconds // 3600) statics['max_hours'] = (statics['outtime'] - statics['intime']).apply(to_hours)
This is simply the logic from the save_numerics function
For me if I used a new output folder and did a re-run of mimic_direct_extract.py worked. However, I am having trouble with Jupyter notebooks' pre-processing tasks. Were you able to run Baselines for Mortality and LOS prediction - GRU-D.ipynb?
I was able to get the notebook working with a few small modifications. I didn't extensively document everything I changed, but it looks like I did at least these:
max_hours
column to statics
before creating the Ys
X_mean
variable in np.nan_to_num
to deal with some nans I was seeing. I forget why I think I had to do thisbatch_size
as an int
I execute 'mimic_direct_extract.py' and I got the same file as the instruction described. I read statics information by: statics = pd.read_hdf(LEVEL2, 'patients') and the shape of 'statics' is (34472, 27), which lack the column of 'max_hours'. So, I want to know what 'max_hours' mean?