EmmaRocheteau / TPC-LoS-prediction

This repository contains the code used for Temporal Pointwise Convolutional Networks for Length of Stay Prediction in the Intensive Care Unit (https://dl.acm.org/doi/10.1145/3450439.3451860).
https://dl.acm.org/doi/10.1145/3450439.3451860
MIT License
76 stars 25 forks source link

Question regarding masked datafields in timeseries.csv processed file #8

Closed KinaraPandya closed 2 years ago

KinaraPandya commented 2 years ago

Hello Emma,

I ran the preprocessing scripts on the original eiCU dataset and noticed the data fields in the timeseries.csv file have "_mask" suffix. For e.g "temperature_mask", "total protein_mask". Can you please help me understand the reason behind creating masked data fields in the processed timeseries.csv file.

Best, Kinara Pandya

EmmaRocheteau commented 2 years ago

Hello! I'm very sorry for how late this reply is. Hopefully it made sense to you in the end, but to the benefit of others who may find this issue. The mask variables are explained in section 4.1 of the paper.

"To help the model cope with this missing data, we forward-filled over the gaps. This is more realistic than interpolation as the clinician would only have the most recent value. We then added ‘decay indicators’ to specify where the data is stale. The decay was calculated as 0.75𝑗 , where 𝑗 is the time since the last recording. This is similar in spirit to the masking used by Che et al."

So essentially they are indicator variables, starting from 1 if the corresponding variable has just been updated (so temperature_mask will be 1 if temperature has been updated), and then they "decay" towards 0 as the data becomes more stale. Hopefully that makes sense!