btheodorou99 / HALO_Inpatient

22 stars 7 forks source link

No file such as "labelProbs.pkl" #16

Closed sweta-lab closed 3 months ago

sweta-lab commented 4 months ago

I am trying to run the continuous variables part of the repository. There is no such file as "labelProbs.pkl". This is not generated by any of the other files. Am I missing something? labelProbs = pickle.load(open('./discretized_data/labelProbs.pkl', 'rb'))

btheodorou99 commented 4 months ago

My apologies for that. Those are supposed to be generated before training, I have just added the relevant code

sweta-lab commented 4 months ago

Thank you very much! Weirdly, I was still able to generate the results without the labelProbs.pkl file. On another note, could you please help me understand the structure of the generated results file?

  1. I have multiple dictionaries containing the keys 'visits' and 'labels'. How can I differentiate the visit on a subject-level basis?

  2. The 'visit' key contains a list of ascending numbers such as [702, 715, 1224, 1595, 1748, 2997, 3061, 3070, 3714, 3881, 3950, 4091] and so on. I am guessing these are timestamps but am unsure if they are hours or days?

  3. Also, the next few lists contain diagnosis codes (e.g., 10200) and lab test names and values, but I am unsure how to map the tests to the values since the length of consecutive lists don't match.

  4. How can I map the 'labels' to the 'visits'? Their lengths don't match. And what do the 'labels' signify?

Thank you in advance!

btheodorou99 commented 4 months ago

Yeah I think it is not actually used because for the experiments we just create an unconditioned copy of the training dataset (so loading the probabilities was just an artifact of other exploration that I forgot to remove). As for your other questions, each dictionary consisting of a labels vector and visits list is a single subject. The labels are static information (so not sequential meant to mirror the visits) such as patient phenotypes, demographic info, etc. The visits (depending if you’ve already discretized or not) are then a list of separate visit time steps each of which is a tuple. The tuple should have 4 elements: a list of variable indices (which correspond to diagnosis codes), a mask of the present labs conducted, the value indices of those labs (because the model operates over discretized labs which are converted into binary variables), and finally an index for the time span since the previous visit (which again is binarized). After running discretized convert the structure will change slightly with all the distressed variables being sampled from. Hopefully this helps, but if you have any further questions feel free to copy a dictionary example into your follow up comment so that I can see exactly which stage it’s at and then point directly to the different structures.

sweta-lab commented 4 months ago

Thank you for the explanation! Here is an example of the generated discretized dictionary:

{'visits': [([263, 702, 852, 1087, 1529, 1595, 2039, 2124, 2141, 2475, 3003,
3097, 3125, 3245, 3878, 4712, 4931, 4955, 5226, 5328, 5704, 5861, 6453, 6980, 
7503, 7561, 7975, 8015, 8356, 8384, 8491, 8976, 8999, 9437, 9741, 9828, 10177, 
10200], [], [], [24.35]), ([], ['Diastolic blood pressure', 'Glucose', 'Heart Rate', 
'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate', 'Systolic blood pressure', 
'Temperature'], [76, 118, 60, 82, 94, 19, 117, 36.5], [0.3]), ([10200], ['Capillary refill rate', 
'Glascow coma scale motor response', 'Glascow coma scale total', 'Heart Rate', 
'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate'], ['1.0', 'Abnormal Flexion', 
'5', 61, 120, 97, 21], [2.4]), ([], ['Heart Rate', 'Mean blood pressure', 'Oxygen saturation', 
'Respiratory rate'], [73, 105, 99, 21], [2.0]), ([10200], ['Capillary refill rate', 
'Glascow coma scale motor response', 'Glascow coma scale total', 'Glucose', 
'Heart Rate', 'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate', 'Temperature'], 
['1.0', 'Abnormal Flexion', '5', 138, 64, 82, 92, 24, 36.9], [1.5]), ([], ['Heart Rate', 'Mean blood pressure',
 'Oxygen saturation', 'Respiratory rate'], [83, 95, 92, 23], [1.7]), ([10200], ['Capillary refill rate',
 'Glascow coma scale motor response', 'Glascow coma scale total', 'Heart Rate', 'Mean blood pressure', 
'Oxygen saturation', 'Respiratory rate', 'Temperature'], ['1.0', 'Abnormal Flexion', '5', 70, 87, 92, 22, 36.6], [2.0]),
 ([], ['Heart Rate', 'Mean blood pressure', 'Oxygen saturation', 'Respiratory rate'], 
[91, 107, 93, 23], [1.9]), ([], ['Heart Rate', 'Oxygen saturation', 'Respiratory rate'], [79, 93, 25], 
[0.9])], 'labels': array([0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)}
btheodorou99 commented 4 months ago

Perfect, yeah that has been discretized but the meanings are largely the same. The visits is a list of tuples where each tuple is a time step. Those numbers [263, 702, 852, ... 10200] are medical codes (the indexToCode mapping can extract the true numerical codes). Next the two empty lists are a lack of labs (because the overall structure is generally initial step with diagnoses and then lab time series for the rest of their stay). 24.35 is the age at admission. The next tuple then has not codes but a blood pressure of 76, glucose of 118, etc. and those labs occur 0.3 hours after the first visit. It continues like this for the rest of the visits.

sweta-lab commented 4 months ago

Perfect -- thank you for the clear explanation!

sweta-lab commented 4 months ago

Sorry about the back-to-back questions.

  1. When I convert the medication labels using the indexToLabel mapping, there are still some codes remaining that do not have a corresponding label. Would you happen to know what these represent? Here is an example, I'd like to know what the codes '5070', '4019', etc., represent.

['5070', '4019', '82133', '82300', '82021', 'E8190', '9973', '7815', '7905', '7936', '7965', '7865', '3891', '8622', '9904', '966', '3324', '8314', '8659', '7935', '7906', '7867', '9672', 'Acetaminophen', 'Potassium Chloride (Powder)', 'Potassium Chloride', '5% Dextrose', 'AcetaZOLamide Sodium', 'Milk of Magnesia', 'D5W', 'Ursodiol', 'Magnesium Sulfate', '0.9% Sodium Chloride', 'FoLIC Acid', 'Morphine Sulfate', 'Aspirin', 'Albuterol 0.083% Neb Soln', 'Bisacodyl', 'Fentanyl Citrate', 'Propofol', 'Docusate Sodium (Liquid)', 'Gentamicin', '5% Dextrose', 'Furosemide', 'Atenolol', 'Docusate Sodium', 'Meperidine', 'Pantoprazole Sodium', 'Acetaminophen', 'Ipratropium Bromide Neb', 'Cefazolin', '0.9% Sodium Chloride', 'Metoprolol Tartrate', 'Lansoprazole Oral Suspension', 'Metoclopramide', 'Phenylephrine', 'Ferrous Sulfate', 'Bisacodyl', 'Levofloxacin', 'Acetaminophen', 'Lactated Ringers', 'Fentanyl Citrate', 'Albuterol', 'Furosemide', 'Atenolol', 'Calcium Gluconate', 'Ferrous Sulfate', 'Lorazepam', 'Metoprolol Tartrate', 'Potassium Chloride', 'Levofloxacin', 'Enoxaparin Sodium']

  1. It seems all or most medications are administered in the first visit only. Is this correct?
btheodorou99 commented 4 months ago

Hello, no worries please ask as many questions as you'd like. Assuming you mean the static, top-level labels, not every code corresponds to a label phenotype (they are just a handful of conditions which do not cover all codes or patients). If you mean indexToCode then every index should map to something, but the numbers you're referring to are likely diagnosis codes (which are numerical rather than word descriptions). You can google ICD9 diagnosis codes or there should be a MIMIC table with the description information. The mapping could also contain modality information (diagnoses, medications, etc.) and the descriptions which would be helpful, but the code for these experiments didn't. Finally, you're right that most of these are in the first visit only. This is just from how we structured the data in our experiments where to capture the lab time series (going to the hospital a patient has a bunch of top-level diagnosis and medication information and then a series of labs during the course of their stay) we have a single top-level diagnoses/medications visit and then a series of lab visits. If a patient has a second hospital stay the pattern will repeat and there will be another medication visit further along. Note this is not a requirement just how we chose to represent the patient (you could aggregate the lab information to include in the single top-level visit or you could have separate models for top-level visits and then lab time series conditioned off of the top-level information.

sweta-lab commented 4 months ago

This is really helpful! Thank you again. Additionally, I wrote some code to convert the synthetic data (after running it through the discretized_convert.py file) to a complete .csv file containing one row per visit for each subject, separate procedure, diagnosis and medication names and every other metric present in the training data. Would you be interested in having this code as part of your repository? If yes, I can send a pull request. :)

btheodorou99 commented 4 months ago

Sure that would be great, feel free to send a pull request!

sweta-lab commented 4 months ago

Done!