Open dancoster opened 1 year ago
Generated using df_not_imputed.csv
file file shared. Units are missing.
Generated using this file. Some units are missing.
Need to filter out some columns from HIRID Longitudinal data based on the below comparision table.
Note: Used the following dictionary to map hirid and mimic features.
hirid_mapping = {
'Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma' : None,
'Albumin [Mass/volume] in Serum or Plasma' : 'Albumin',
'Amylase [Enzymatic activity/volume] in Serum or Plasma': None,
'Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma' : None,
'Bicarbonate [Moles/volume] in Arterial blood':'Bicarbonate',
'Bilirubin.direct [Mass/volume] in Serum or Plasma': None,
'Bilirubin.total [Moles/volume] in Serum or Plasma' : None,
'Calcium [Moles/volume] in Blood': 'Calcium',
'Calcium.ionized [Moles/volume] in Blood': 'Calcium',
'Carboxyhemoglobin/Hemoglobin.total in Arterial blood': 'Hemoglobin',
'Chloride [Moles/volume] in Blood': 'Chloride',
'Core body temperature': 'Temperature',
'Creatinine [Moles/volume] in Blood': 'Creatinine',
'Diastolic arterial pressure': 'Diastolic blood pressure',
'Glucose [Moles/volume] in Serum or Plasma': 'Glucose',
'Heart rate': 'Heart Rate',
'Hemoglobin [Mass/volume] in blood': 'Hemoglobin',
'INR in Blood by Coagulation assay': 'Prothrombin time INR',
'Lactate [Mass/volume] in blood': 'Lactic acid',
'Lymphocytes [#/volume] in Blood': 'Lymphocytes',
'Magnesium [Moles/volume] in Blood': 'Magnesium',
'Methemoglobin/Hemoglobin.total in Arterial blood': 'Hemoglobin',
'Neutrophils/100 leukocytes in Blood': 'Neutrophils',
'Peripheral oxygen saturation': 'Oxygen saturation',
'Platelets [#/volume] in Blood': 'Platelets',
'Potassium [Moles/volume] in Blood': 'Potassium',
'Pulmonary artery diastolic pressure': 'Diastolic blood pressure',
'Pulmonary artery systolic pressure': 'Systolic blood pressure',
'Respiratory rate': 'Respiratory rate',
'Sodium [Moles/volume] in Blood': 'Sodium',
'Systolic arterial pressure': 'Systolic blood pressure'
}
Thanks Pavan, could you please add a column of 'Missing Rate(%) after ffill)'. (Trying to replicate the second table). I think we can have only one table.
Please also add a col of 'Subjects Missing Rate(%)' - i.e.,the precentage of subjects who had even no one measurement of this feature.
Please add also "N subjects" (the total number of subjects who had this measurement).
Calculate the for each of the datasets seperately from the table: (a) mortality rate (subjects level) (b) Percentage of Obs. were mortality in the next 0-48h is positive. (c) Length of Stay. (d) Age (e) Gender Prop/
df[temp_new_lab_name] = df.groupby('subject_id')[temp_new_lab_name].ffill()
MIMIC_missing_rate_after_ffill.csv HIRID_missing_rate_after_ffill.csv
Gender Stats (Subject level) : MIMIC_gender.csv Age and LOS stats : MIMIC_age_los.csv Mortality Stats (Subject level) : MIMIC_mortality_stats.csv Mortality Observations : MIMIC_mortality_stats_obs.csv
Age and LOS stats : HIRID_age_los.csv Gender stats : HIRID_gender.csv Mortality Observations count : HIRID_mortality_stats_obs.csv Mortality Stats (Subject level) : HIRID_mortality_stats.csv
Mortality - subject level (required output is percentage) Label_48 - observation level (required output is percentage) Age and LOS should be calculate on subject level (each subject have 1 value of age and 1 value of LOS). The best will be to create a function that will print all these outputs. For gender please report also Males %.
Used the below class for generating the statitics.
class OtherStatsGenerator:
def __init__(self, extract, input_type):
self.extract = extract
self.input_type = input_type
def gender_percentage(self, mimic_extract):
gender = mimic_extract[["GENDER", "hadm_id"]].drop_duplicates().groupby(["GENDER"]).count()
total = gender.sum()[0]
return gender.apply(lambda row : row["hadm_id"] * 100 / total, axis=1)
def age_los_subject_level(self, mimic_extract):
age_los = mimic_extract.round(2).groupby(["hadm_id", "age", "LOS"]).nth(0).reset_index()[["hadm_id", "age", "LOS"]]
return age_los
def label48_subject_level(self, mimic_extract):
lab48_df = mimic_extract[["Mortality", "LABEL_48", "hadm_id"]].drop_duplicates().groupby(["Mortality", "LABEL_48"]).count()
total_adm = lab48_df.sum()
return lab48_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)
def label48_observation_level(self, mimic_extract):
lab48_df = mimic_extract[["Mortality", "LABEL_48", "hadm_id"]].groupby(["Mortality", "LABEL_48"]).count()
total_adm = lab48_df.sum()
return lab48_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)
def mortality_subject_level(self, mimic_extract):
mortality_df = mimic_extract[["Mortality", "hadm_id"]].drop_duplicates().groupby(["Mortality"]).count()
total_adm = mortality_df.sum()
return mortality_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)
def generate_all_stats(self):
extract = self.extract
return self.mortality_subject_level(extract), self.label48_observation_level(extract), self.age_los_subject_level(extract), self.gender_percentage(extract)
def save_stats(self, stats):
map_d = {
0:f"{self.input_type} Mortality Subject Level.csv",
1:f"{self.input_type} Label48 Observation Level.csv",
2:f"{self.input_type} Age and LOS subject level.csv",
3:f"{self.input_type} Gender Subject Level.csv"
}
for i in range(len(mimic_other_stats_res)):
mimic_other_stats_res[i].to_csv(os.path.join("/Users/pavan/Desktop/TAU/DrugLab/results/stats", map_d[i]))
The above code can be run using below code
mimic_extract["charttime"] = pd.to_datetime(mimic_extract["charttime"])
mimic_extract["DEATHTIME"] = pd.to_datetime(mimic_extract["DEATHTIME"])
mimic_extract["DISCHTIME"] = pd.to_datetime(mimic_extract["DISCHTIME"])
mimic_extract["ADMITTIME"] = pd.to_datetime(mimic_extract["ADMITTIME"])
mimic_extract["LOS"] = mimic_extract["DISCHTIME"] - mimic_extract["ADMITTIME"]
mimic_extract["Mortality"] = mimic_extract.DEATHTIME.isna().apply(lambda row: 0 if row else 1)
mimic_extract["LABEL_48"] = mimic_extract.apply(lambda row: 1 if row["Mortality"]==1 and row["LOS"]<pd.Timedelta(hours=48) else 0, axis=1)
mimic_other_stats_gen = OtherStatsGenerator(mimic_extract, input_type="mimic")
mimic_other_stats_res = mimic_other_stats_gen.generate_all_stats()
mimic_other_stats_gen.save_stats(mimic_other_stats_res)
hirid_parser = hirid.HiRiDParser(data=raw_path, res=res_path, gender=gender, age_b=age_b, age_a=age_a, load="MANUAL_MAPPING_HIRID")
hirid_extract = pd.merge(hirid_extract.rename(columns={"HADM_ID":"hadm_id", "AGE":"age"}), hirid_parser.g_table[["patientid", "admissiontime"]].rename(columns={"patientid":"hadm_id", "admissiontime":"ADMITTIME"}), on="hadm_id")
hirid_extract["Mortality"] = hirid_extract.discharge_status.apply(lambda r: 0 if r=="alive" else 1)
hirid_extract["LOS"] = pd.to_datetime(hirid_extract["EST_DISCHTIME"]) - pd.to_datetime(hirid_extract["ADMITTIME"])
hirid_other_stats_gen = OtherStatsGenerator(hirid_extract, input_type="hirid")
hirid_other_stats_res = hirid_other_stats_gen.generate_all_stats()
hirid_other_stats_gen.save_stats(hirid_other_stats_res)
hirid Gender Subject Level.csv hirid Age and LOS subject level.csv hirid Label48 Observation Level.csv hirid Mortality Subject Level.csv mimic Gender Subject Level.csv mimic Age and LOS subject level.csv mimic Label48 Observation Level.csv mimic Mortality Subject Level.csv
Please merge all to this parameters to this table:
Looks better, few comments: (a) Please add Race proportions. (b) I think that there is a bug in the mortality rate (%) and mortality rate in the next 48h in Hirid/MIMIC:
(c) Attached are few 'cosmetical' comments marked in Red and Yellow.
MIMIC_missing_rate_after_ffill_comments.xls
(d) please write code that generate the rows of Age, Gender, mortality rate (%) and mortality rate in the next 48h and LOS - there are different fonts and it seems that it might be copy-pasted.
Race
information is only available in MIMIC and is missing in HIRID, so I have excluded it.Good.
(a) the commas in the MIMIC and HiRID in "N" col are partially wrong in 7 and 6 digit numbers: "2,00,242" -> "200,242" (b) Please sort the labs+vitals in MIMIC alphabetically (its ok to keep Age,LOS etc n the bottom).
Just to confirm: (c) do you use here only WHITE between 50-80? (d) This table is generated before the unit conversion of the different lab tests?
(a) remove all comas. (c) WHITE, 40-80. (d) Please, generate this table after unit conversion.
[1] Write a function that generate exactly the same format of table for both MIMIC and HiRiD:
This is the table generated for longitudinal data. There are several problems with it:
This is a better one, I based on the full mimic_extract, and calculated the missing observation rate as based on patient events.
Two unrelated comments Concering the fill_forward rates -what I tried to do was to simply fill any missing observation with the previous one, but to what affect? to see changes in the mean for the table? If so I can easily generate this as well
Could you create a separate ticket for the external calculations? There is a lot of room for interpretation on how do some of the things.
Generate this table on the raw data from MIMIC. Add this charaterstics: (a) mortality rate (subjects level) (b) Percentage of Obs. were mortality in the next 0-48h is positive. (c) Length of Stay. (d) Age (e) Gender Prop
[1] Write a function that generate exactly the same format of table for both MIMIC and HiRiD:
Please notice that N is the number of observations and the missing rate is of the observations (=number of missing rows). Please change the cols names to "N observations" and "Observations Missing Rate"). Please add also "N subjects" and a missing rate of subjects without any measure of this variable (="Subjects Missing Rate").
[2] Write a function that generate this format of table for MIMIC and HiRid of Missing rates after forward-filling imputation:
Please add column of the "Percentage of subjects with no missing values" (after the fill forward)
Attached is the original table: ICLR_Workshop_2023TSRLH__arXiv (1).pdf