Tables - Githubissues

dancoster commented 1 year ago

[1] Write a function that generate exactly the same format of table for both MIMIC and HiRiD:

Please notice that N is the number of observations and the missing rate is of the observations (=number of missing rows). Please change the cols names to "N observations" and "Observations Missing Rate"). Please add also "N subjects" and a missing rate of subjects without any measure of this variable (="Subjects Missing Rate").

[2] Write a function that generate this format of table for MIMIC and HiRid of Missing rates after forward-filling imputation:

Please add column of the "Percentage of subjects with no missing values" (after the fill forward)

Attached is the original table: ICLR_Workshop_2023TSRLH__arXiv (1).pdf

PavanReddy28 commented 1 year ago

MIMIC Missing Rate

Generated using df_not_imputed.csv file file shared. Units are missing.

MIMIC Missing Rate.csv

HIRID Missing Rate

Generated using this file. Some units are missing.

HIRID Missing Rate.csv

PavanReddy28 commented 1 year ago

Comparing MIMIC and HIRID Longitudinal data features

Left columns (_x) are associated with mean and std of HIRID features
Right columns (_y) are associated with mean and std of MIMIC features.

Need to filter out some columns from HIRID Longitudinal data based on the below comparision table.

Merged MIMIC and HIRID.csv

Note: Used the following dictionary to map hirid and mimic features.

hirid_mapping = {
    'Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma' : None,
    'Albumin [Mass/volume] in Serum or Plasma' : 'Albumin',
    'Amylase [Enzymatic activity/volume] in Serum or Plasma': None,
    'Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma' : None,
    'Bicarbonate [Moles/volume] in Arterial blood':'Bicarbonate',
    'Bilirubin.direct [Mass/volume] in Serum or Plasma': None,
    'Bilirubin.total [Moles/volume] in Serum or Plasma' : None,
    'Calcium [Moles/volume] in Blood': 'Calcium',
    'Calcium.ionized [Moles/volume] in Blood': 'Calcium',
    'Carboxyhemoglobin/Hemoglobin.total in Arterial blood': 'Hemoglobin',
    'Chloride [Moles/volume] in Blood': 'Chloride', 
    'Core body temperature': 'Temperature',
    'Creatinine [Moles/volume] in Blood': 'Creatinine', 
    'Diastolic arterial pressure': 'Diastolic blood pressure',
    'Glucose [Moles/volume] in Serum or Plasma': 'Glucose', 
    'Heart rate': 'Heart Rate',
    'Hemoglobin [Mass/volume] in blood': 'Hemoglobin',
    'INR in Blood by Coagulation assay': 'Prothrombin time INR', 
    'Lactate [Mass/volume] in blood': 'Lactic acid',
    'Lymphocytes [#/volume] in Blood': 'Lymphocytes', 
    'Magnesium [Moles/volume] in Blood': 'Magnesium',
    'Methemoglobin/Hemoglobin.total in Arterial blood': 'Hemoglobin',
    'Neutrophils/100 leukocytes in Blood': 'Neutrophils', 
    'Peripheral oxygen saturation': 'Oxygen saturation',
    'Platelets [#/volume] in Blood': 'Platelets', 
    'Potassium [Moles/volume] in Blood': 'Potassium',
    'Pulmonary artery diastolic pressure': 'Diastolic blood pressure',
    'Pulmonary artery systolic pressure': 'Systolic blood pressure', 
    'Respiratory rate': 'Respiratory rate',
    'Sodium [Moles/volume] in Blood': 'Sodium', 
    'Systolic arterial pressure': 'Systolic blood pressure'
}

dancoster commented 1 year ago

Thanks Pavan, could you please add a column of 'Missing Rate(%) after ffill)'. (Trying to replicate the second table). I think we can have only one table.

dancoster commented 1 year ago

Please also add a col of 'Subjects Missing Rate(%)' - i.e.,the precentage of subjects who had even no one measurement of this feature.

dancoster commented 1 year ago

Please add also "N subjects" (the total number of subjects who had this measurement).

dancoster commented 1 year ago

Calculate the for each of the datasets seperately from the table: (a) mortality rate (subjects level) (b) Percentage of Obs. were mortality in the next 0-48h is positive. (c) Length of Stay. (d) Age (e) Gender Prop/

PavanReddy28 commented 1 year ago

Missing Rate Tables

Before and After ffill's

MIMIC_missing_rate_after_ffill.csv

HIRID_missing_rate_after_ffill.csv

dancoster commented 1 year ago

df[temp_new_lab_name] = df.groupby('subject_id')[temp_new_lab_name].ffill()

PavanReddy28 commented 1 year ago

Updated Missing Rate table

MIMIC_missing_rate_after_ffill.csv HIRID_missing_rate_after_ffill.csv

MIMIC Statistics

Gender Stats (Subject level) : MIMIC_gender.csv Age and LOS stats : MIMIC_age_los.csv Mortality Stats (Subject level) : MIMIC_mortality_stats.csv Mortality Observations : MIMIC_mortality_stats_obs.csv

PavanReddy28 commented 1 year ago

HIRID Stats

Age and LOS stats : HIRID_age_los.csv Gender stats : HIRID_gender.csv Mortality Observations count : HIRID_mortality_stats_obs.csv Mortality Stats (Subject level) : HIRID_mortality_stats.csv

dancoster commented 1 year ago

Mortality - subject level (required output is percentage) Label_48 - observation level (required output is percentage) Age and LOS should be calculate on subject level (each subject have 1 value of age and 1 value of LOS). The best will be to create a function that will print all these outputs. For gender please report also Males %.

PavanReddy28 commented 1 year ago

Updated Statistics

Used the below class for generating the statitics.

class OtherStatsGenerator:
    def __init__(self, extract, input_type):
        self.extract = extract
        self.input_type = input_type

    def gender_percentage(self, mimic_extract):
        gender = mimic_extract[["GENDER", "hadm_id"]].drop_duplicates().groupby(["GENDER"]).count()
        total = gender.sum()[0]
        return gender.apply(lambda row : row["hadm_id"] * 100 / total, axis=1)

    def age_los_subject_level(self, mimic_extract):
        age_los = mimic_extract.round(2).groupby(["hadm_id", "age", "LOS"]).nth(0).reset_index()[["hadm_id", "age", "LOS"]]
        return age_los

    def label48_subject_level(self, mimic_extract):
        lab48_df = mimic_extract[["Mortality", "LABEL_48", "hadm_id"]].drop_duplicates().groupby(["Mortality", "LABEL_48"]).count()
        total_adm = lab48_df.sum()
        return lab48_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)

    def label48_observation_level(self, mimic_extract):
        lab48_df = mimic_extract[["Mortality", "LABEL_48", "hadm_id"]].groupby(["Mortality", "LABEL_48"]).count()
        total_adm = lab48_df.sum()
        return lab48_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)

    def mortality_subject_level(self, mimic_extract):
        mortality_df = mimic_extract[["Mortality", "hadm_id"]].drop_duplicates().groupby(["Mortality"]).count()
        total_adm = mortality_df.sum()
        return mortality_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)

    def generate_all_stats(self):
        extract = self.extract
        return self.mortality_subject_level(extract), self.label48_observation_level(extract), self.age_los_subject_level(extract), self.gender_percentage(extract)

    def save_stats(self, stats):
        map_d = {
            0:f"{self.input_type} Mortality Subject Level.csv",
            1:f"{self.input_type} Label48 Observation Level.csv",
            2:f"{self.input_type} Age and LOS subject level.csv",
            3:f"{self.input_type} Gender Subject Level.csv"
        }
        for i in range(len(mimic_other_stats_res)):
            mimic_other_stats_res[i].to_csv(os.path.join("/Users/pavan/Desktop/TAU/DrugLab/results/stats", map_d[i]))

The above code can be run using below code

mimic_extract["charttime"] = pd.to_datetime(mimic_extract["charttime"])
mimic_extract["DEATHTIME"] = pd.to_datetime(mimic_extract["DEATHTIME"])
mimic_extract["DISCHTIME"] = pd.to_datetime(mimic_extract["DISCHTIME"])
mimic_extract["ADMITTIME"] = pd.to_datetime(mimic_extract["ADMITTIME"])
mimic_extract["LOS"] = mimic_extract["DISCHTIME"] - mimic_extract["ADMITTIME"]
mimic_extract["Mortality"] = mimic_extract.DEATHTIME.isna().apply(lambda row: 0 if row else 1)
mimic_extract["LABEL_48"] = mimic_extract.apply(lambda row: 1 if row["Mortality"]==1 and row["LOS"]<pd.Timedelta(hours=48) else 0, axis=1)
mimic_other_stats_gen = OtherStatsGenerator(mimic_extract, input_type="mimic")
mimic_other_stats_res = mimic_other_stats_gen.generate_all_stats()
mimic_other_stats_gen.save_stats(mimic_other_stats_res)

hirid_parser = hirid.HiRiDParser(data=raw_path, res=res_path, gender=gender, age_b=age_b, age_a=age_a, load="MANUAL_MAPPING_HIRID")
hirid_extract = pd.merge(hirid_extract.rename(columns={"HADM_ID":"hadm_id", "AGE":"age"}), hirid_parser.g_table[["patientid", "admissiontime"]].rename(columns={"patientid":"hadm_id", "admissiontime":"ADMITTIME"}), on="hadm_id")
hirid_extract["Mortality"] = hirid_extract.discharge_status.apply(lambda r: 0 if r=="alive" else 1)
hirid_extract["LOS"] = pd.to_datetime(hirid_extract["EST_DISCHTIME"]) - pd.to_datetime(hirid_extract["ADMITTIME"])
hirid_other_stats_gen = OtherStatsGenerator(hirid_extract, input_type="hirid")
hirid_other_stats_res = hirid_other_stats_gen.generate_all_stats()
hirid_other_stats_gen.save_stats(hirid_other_stats_res)

hirid Gender Subject Level.csv hirid Age and LOS subject level.csv hirid Label48 Observation Level.csv hirid Mortality Subject Level.csv mimic Gender Subject Level.csv mimic Age and LOS subject level.csv mimic Label48 Observation Level.csv mimic Mortality Subject Level.csv

dancoster commented 1 year ago

Please merge all to this parameters to this table:

dancoster commented 1 year ago

Add to table only the colomon of Missing rate after ffill. https://github.com/dancoster/DrugLab/issues/43#issuecomment-1518502863
round to 2 digits after dot in 'Subjects Missing Rate(%)'
For count measurments (N and N subjects) add coma (i.e. 1000 -> 1,000).
Add the 'demographic' variables to the table in this manner:
- Please add Race proportions.

PavanReddy28 commented 1 year ago

HIRID_missing_rate_after_ffill.xls MIMIC_missing_rate_after_ffill.xls

dancoster commented 1 year ago

Looks better, few comments: (a) Please add Race proportions. (b) I think that there is a bug in the mortality rate (%) and mortality rate in the next 48h in Hirid/MIMIC:

(c) Attached are few 'cosmetical' comments marked in Red and Yellow.

MIMIC_missing_rate_after_ffill_comments.xls

(d) please write code that generate the rows of Age, Gender, mortality rate (%) and mortality rate in the next 48h and LOS - there are different fonts and it seems that it might be copy-pasted.

PavanReddy28 commented 1 year ago

Update

Race information is only available in MIMIC and is missing in HIRID, so I have excluded it.
Updated files are attached below HIRID_missing_rate_after_ffill.csv MIMIC_missing_rate_after_ffill.csv

dancoster commented 1 year ago

Good.

(a) the commas in the MIMIC and HiRID in "N" col are partially wrong in 7 and 6 digit numbers: "2,00,242" -> "200,242" (b) Please sort the labs+vitals in MIMIC alphabetically (its ok to keep Age,LOS etc n the bottom).

Just to confirm: (c) do you use here only WHITE between 50-80? (d) This table is generated before the unit conversion of the different lab tests?

dancoster commented 1 year ago

(a) remove all comas. (c) WHITE, 40-80. (d) Please, generate this table after unit conversion.

dancoster commented 4 months ago

[1] Write a function that generate exactly the same format of table for both MIMIC and HiRiD:

Add column of missing values after Fill Forward.
Rate of patient with missing rate.
Calculate the for each of the datasets separately from the table (missing rate patient level) (a) mortality rate (subjects level) (b) Percentage of Obs. were mortality in the next 0-48h is positive. (c) Length of Stay. (d) Age (e) Gender Prop [2] Harmonize units of each dataset across datasets.

dancoster commented 4 months ago

generate table based on longitudinal data frame (make sure you exclude based on df_inhuman).
Double check STD after that
two digits after dot.
add comma for thoushands (48321 -> 48,321)

rfeinsod commented 4 months ago

mimic_table_no_units.csv

This is the table generated for longitudinal data. There are several problems with it:

The std is still messed up.
The longitudinal data frame has no units for tests.
Many of the tests have no inhuman values.

rfeinsod commented 4 months ago

basic_mimic_table.csv

This is a better one, I based on the full mimic_extract, and calculated the missing observation rate as based on patient events.

rfeinsod commented 4 months ago

Two unrelated comments Concering the fill_forward rates -what I tried to do was to simply fill any missing observation with the previous one, but to what affect? to see changes in the mean for the table? If so I can easily generate this as well

Could you create a separate ticket for the external calculations? There is a lot of room for interpretation on how do some of the things.

rfeinsod commented 4 months ago

Mimic_table_with_only_relevant_labs.csv

dancoster commented 3 months ago

Generate this table on the raw data from MIMIC. Add this charaterstics: (a) mortality rate (subjects level) (b) Percentage of Obs. were mortality in the next 0-48h is positive. (c) Length of Stay. (d) Age (e) Gender Prop

rfeinsod commented 2 months ago

mimic_general_data.csv

dancoster / DrugLab

Tables #43

MIMIC Missing Rate

HIRID Missing Rate

Comparing MIMIC and HIRID Longitudinal data features

Missing Rate Tables

Updated Missing Rate table

MIMIC Statistics

HIRID Stats

Updated Statistics

Update