btheodorou99 / HALO_Inpatient

17 stars 5 forks source link

No file such as "labelProbs.pkl" #16

Open sweta-lab opened 6 days ago

sweta-lab commented 6 days ago

I am trying to run the continuous variables part of the repository. There is no such file as "labelProbs.pkl". This is not generated by any of the other files. Am I missing something? labelProbs = pickle.load(open('./discretized_data/labelProbs.pkl', 'rb'))

btheodorou99 commented 5 days ago

My apologies for that. Those are supposed to be generated before training, I have just added the relevant code

sweta-lab commented 5 days ago

Thank you very much! Weirdly, I was still able to generate the results without the labelProbs.pkl file. On another note, could you please help me understand the structure of the generated results file?

  1. I have multiple dictionaries containing the keys 'visits' and 'labels'. How can I differentiate the visit on a subject-level basis?

  2. The 'visit' key contains a list of ascending numbers such as [702, 715, 1224, 1595, 1748, 2997, 3061, 3070, 3714, 3881, 3950, 4091] and so on. I am guessing these are timestamps but am unsure if they are hours or days?

  3. Also, the next few lists contain diagnosis codes (e.g., 10200) and lab test names and values, but I am unsure how to map the tests to the values since the length of consecutive lists don't match.

  4. How can I map the 'labels' to the 'visits'? Their lengths don't match. And what do the 'labels' signify?

Thank you in advance!

btheodorou99 commented 5 days ago

Yeah I think it is not actually used because for the experiments we just create an unconditioned copy of the training dataset (so loading the probabilities was just an artifact of other exploration that I forgot to remove). As for your other questions, each dictionary consisting of a labels vector and visits list is a single subject. The labels are static information (so not sequential meant to mirror the visits) such as patient phenotypes, demographic info, etc. The visits (depending if you’ve already discretized or not) are then a list of separate visit time steps each of which is a tuple. The tuple should have 4 elements: a list of variable indices (which correspond to diagnosis codes), a mask of the present labs conducted, the value indices of those labs (because the model operates over discretized labs which are converted into binary variables), and finally an index for the time span since the previous visit (which again is binarized). After running discretized convert the structure will change slightly with all the distressed variables being sampled from. Hopefully this helps, but if you have any further questions feel free to copy a dictionary example into your follow up comment so that I can see exactly which stage it’s at and then point directly to the different structures.

sweta-lab commented 5 days ago

Thank you for the explanation! Here is an example of the generated discretized dictionary:

{'visits': [([263, 702, 852, 1087, 1529, 1595, 2039, 2124, 2141, 2475, 3003,
3097, 3125, 3245, 3878, 4712, 4931, 4955, 5226, 5328, 5704, 5861, 6453, 6980, 
7503, 7561, 7975, 8015, 8356, 8384, 8491, 8976, 8999, 9437, 9741, 9828, 10177, 
10200], [], [], [24.35]), ([], ['Diastolic blood pressure', 'Glucose', 'Heart Rate', 
'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate', 'Systolic blood pressure', 
'Temperature'], [76, 118, 60, 82, 94, 19, 117, 36.5], [0.3]), ([10200], ['Capillary refill rate', 
'Glascow coma scale motor response', 'Glascow coma scale total', 'Heart Rate', 
'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate'], ['1.0', 'Abnormal Flexion', 
'5', 61, 120, 97, 21], [2.4]), ([], ['Heart Rate', 'Mean blood pressure', 'Oxygen saturation', 
'Respiratory rate'], [73, 105, 99, 21], [2.0]), ([10200], ['Capillary refill rate', 
'Glascow coma scale motor response', 'Glascow coma scale total', 'Glucose', 
'Heart Rate', 'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate', 'Temperature'], 
['1.0', 'Abnormal Flexion', '5', 138, 64, 82, 92, 24, 36.9], [1.5]), ([], ['Heart Rate', 'Mean blood pressure',
 'Oxygen saturation', 'Respiratory rate'], [83, 95, 92, 23], [1.7]), ([10200], ['Capillary refill rate',
 'Glascow coma scale motor response', 'Glascow coma scale total', 'Heart Rate', 'Mean blood pressure', 
'Oxygen saturation', 'Respiratory rate', 'Temperature'], ['1.0', 'Abnormal Flexion', '5', 70, 87, 92, 22, 36.6], [2.0]),
 ([], ['Heart Rate', 'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate'], 
[91, 107, 93, 23], [1.9]), ([], ['Heart Rate', 'Oxygen saturation', 'Respiratory rate'], [79, 93, 25], 
[0.9])], 'labels': array([0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)}
btheodorou99 commented 5 days ago

Perfect, yeah that has been discretized but the meanings are largely the same. The visits is a list of tuples where each tuple is a time step. Those numbers [263, 702, 852, ... 10200] are medical codes (the indexToCode mapping can extract the true numerical codes). Next the two empty lists are a lack of labs (because the overall structure is generally initial step with diagnoses and then lab time series for the rest of their stay). 24.35 is the age at admission. The next tuple then has not codes but a blood pressure of 76, glucose of 118, etc. and those labs occur 0.3 hours after the first visit. It continues like this for the rest of the visits.

sweta-lab commented 5 days ago

Perfect -- thank you for the clear explanation!