dancoster / DrugLab

Repository for the drug<>lab pair
1 stars 0 forks source link

Tables #43

Open dancoster opened 1 year ago

dancoster commented 1 year ago

[1] Write a function that generate exactly the same format of table for both MIMIC and HiRiD:

Image

Please notice that N is the number of observations and the missing rate is of the observations (=number of missing rows). Please change the cols names to "N observations" and "Observations Missing Rate"). Please add also "N subjects" and a missing rate of subjects without any measure of this variable (="Subjects Missing Rate").

[2] Write a function that generate this format of table for MIMIC and HiRid of Missing rates after forward-filling imputation:

Image

Please add column of the "Percentage of subjects with no missing values" (after the fill forward)

Attached is the original table: ICLR_Workshop_2023TSRLH__arXiv (1).pdf

PavanReddy28 commented 1 year ago

MIMIC Missing Rate

Generated using df_not_imputed.csv file file shared. Units are missing.

MIMIC Missing Rate.csv

HIRID Missing Rate

Generated using this file. Some units are missing.

HIRID Missing Rate.csv

PavanReddy28 commented 1 year ago

Comparing MIMIC and HIRID Longitudinal data features

Need to filter out some columns from HIRID Longitudinal data based on the below comparision table.

Image

Merged MIMIC and HIRID.csv

Note: Used the following dictionary to map hirid and mimic features.

hirid_mapping = {
    'Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma' : None,
    'Albumin [Mass/volume] in Serum or Plasma' : 'Albumin',
    'Amylase [Enzymatic activity/volume] in Serum or Plasma': None,
    'Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma' : None,
    'Bicarbonate [Moles/volume] in Arterial blood':'Bicarbonate',
    'Bilirubin.direct [Mass/volume] in Serum or Plasma': None,
    'Bilirubin.total [Moles/volume] in Serum or Plasma' : None,
    'Calcium [Moles/volume] in Blood': 'Calcium',
    'Calcium.ionized [Moles/volume] in Blood': 'Calcium',
    'Carboxyhemoglobin/Hemoglobin.total in Arterial blood': 'Hemoglobin',
    'Chloride [Moles/volume] in Blood': 'Chloride', 
    'Core body temperature': 'Temperature',
    'Creatinine [Moles/volume] in Blood': 'Creatinine', 
    'Diastolic arterial pressure': 'Diastolic blood pressure',
    'Glucose [Moles/volume] in Serum or Plasma': 'Glucose', 
    'Heart rate': 'Heart Rate',
    'Hemoglobin [Mass/volume] in blood': 'Hemoglobin',
    'INR in Blood by Coagulation assay': 'Prothrombin time INR', 
    'Lactate [Mass/volume] in blood': 'Lactic acid',
    'Lymphocytes [#/volume] in Blood': 'Lymphocytes', 
    'Magnesium [Moles/volume] in Blood': 'Magnesium',
    'Methemoglobin/Hemoglobin.total in Arterial blood': 'Hemoglobin',
    'Neutrophils/100 leukocytes in Blood': 'Neutrophils', 
    'Peripheral oxygen saturation': 'Oxygen saturation',
    'Platelets [#/volume] in Blood': 'Platelets', 
    'Potassium [Moles/volume] in Blood': 'Potassium',
    'Pulmonary artery diastolic pressure': 'Diastolic blood pressure',
    'Pulmonary artery systolic pressure': 'Systolic blood pressure', 
    'Respiratory rate': 'Respiratory rate',
    'Sodium [Moles/volume] in Blood': 'Sodium', 
    'Systolic arterial pressure': 'Systolic blood pressure'
}
dancoster commented 1 year ago

Thanks Pavan, could you please add a column of 'Missing Rate(%) after ffill)'. (Trying to replicate the second table). I think we can have only one table.

dancoster commented 1 year ago

Please also add a col of 'Subjects Missing Rate(%)' - i.e.,the precentage of subjects who had even no one measurement of this feature.

dancoster commented 1 year ago

Please add also "N subjects" (the total number of subjects who had this measurement).

dancoster commented 1 year ago

Calculate the for each of the datasets seperately from the table: (a) mortality rate (subjects level) (b) Percentage of Obs. were mortality in the next 0-48h is positive. (c) Length of Stay. (d) Age (e) Gender Prop/

PavanReddy28 commented 1 year ago

Missing Rate Tables

Before and After ffill's

MIMIC_missing_rate_after_ffill.csv

HIRID_missing_rate_after_ffill.csv

dancoster commented 1 year ago

df[temp_new_lab_name] = df.groupby('subject_id')[temp_new_lab_name].ffill()

PavanReddy28 commented 1 year ago

Updated Missing Rate table

MIMIC_missing_rate_after_ffill.csv HIRID_missing_rate_after_ffill.csv

MIMIC Statistics

Image

Gender Stats (Subject level) : MIMIC_gender.csv Age and LOS stats : MIMIC_age_los.csv Mortality Stats (Subject level) : MIMIC_mortality_stats.csv Mortality Observations : MIMIC_mortality_stats_obs.csv

PavanReddy28 commented 1 year ago

HIRID Stats

Image

Age and LOS stats : HIRID_age_los.csv Gender stats : HIRID_gender.csv Mortality Observations count : HIRID_mortality_stats_obs.csv Mortality Stats (Subject level) : HIRID_mortality_stats.csv

dancoster commented 1 year ago

Mortality - subject level (required output is percentage) Label_48 - observation level (required output is percentage) Age and LOS should be calculate on subject level (each subject have 1 value of age and 1 value of LOS). The best will be to create a function that will print all these outputs. For gender please report also Males %.

PavanReddy28 commented 1 year ago

Updated Statistics

Used the below class for generating the statitics.

class OtherStatsGenerator:
    def __init__(self, extract, input_type):
        self.extract = extract
        self.input_type = input_type

    def gender_percentage(self, mimic_extract):
        gender = mimic_extract[["GENDER", "hadm_id"]].drop_duplicates().groupby(["GENDER"]).count()
        total = gender.sum()[0]
        return gender.apply(lambda row : row["hadm_id"] * 100 / total, axis=1)

    def age_los_subject_level(self, mimic_extract):
        age_los = mimic_extract.round(2).groupby(["hadm_id", "age", "LOS"]).nth(0).reset_index()[["hadm_id", "age", "LOS"]]
        return age_los

    def label48_subject_level(self, mimic_extract):
        lab48_df = mimic_extract[["Mortality", "LABEL_48", "hadm_id"]].drop_duplicates().groupby(["Mortality", "LABEL_48"]).count()
        total_adm = lab48_df.sum()
        return lab48_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)

    def label48_observation_level(self, mimic_extract):
        lab48_df = mimic_extract[["Mortality", "LABEL_48", "hadm_id"]].groupby(["Mortality", "LABEL_48"]).count()
        total_adm = lab48_df.sum()
        return lab48_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)

    def mortality_subject_level(self, mimic_extract):
        mortality_df = mimic_extract[["Mortality", "hadm_id"]].drop_duplicates().groupby(["Mortality"]).count()
        total_adm = mortality_df.sum()
        return mortality_df.apply(lambda row : row["hadm_id"]*100/total_adm, axis=1)

    def generate_all_stats(self):
        extract = self.extract
        return self.mortality_subject_level(extract), self.label48_observation_level(extract), self.age_los_subject_level(extract), self.gender_percentage(extract)

    def save_stats(self, stats):
        map_d = {
            0:f"{self.input_type} Mortality Subject Level.csv",
            1:f"{self.input_type} Label48 Observation Level.csv",
            2:f"{self.input_type} Age and LOS subject level.csv",
            3:f"{self.input_type} Gender Subject Level.csv"
        }
        for i in range(len(mimic_other_stats_res)):
            mimic_other_stats_res[i].to_csv(os.path.join("/Users/pavan/Desktop/TAU/DrugLab/results/stats", map_d[i]))

The above code can be run using below code

mimic_extract["charttime"] = pd.to_datetime(mimic_extract["charttime"])
mimic_extract["DEATHTIME"] = pd.to_datetime(mimic_extract["DEATHTIME"])
mimic_extract["DISCHTIME"] = pd.to_datetime(mimic_extract["DISCHTIME"])
mimic_extract["ADMITTIME"] = pd.to_datetime(mimic_extract["ADMITTIME"])
mimic_extract["LOS"] = mimic_extract["DISCHTIME"] - mimic_extract["ADMITTIME"]
mimic_extract["Mortality"] = mimic_extract.DEATHTIME.isna().apply(lambda row: 0 if row else 1)
mimic_extract["LABEL_48"] = mimic_extract.apply(lambda row: 1 if row["Mortality"]==1 and row["LOS"]<pd.Timedelta(hours=48) else 0, axis=1)
mimic_other_stats_gen = OtherStatsGenerator(mimic_extract, input_type="mimic")
mimic_other_stats_res = mimic_other_stats_gen.generate_all_stats()
mimic_other_stats_gen.save_stats(mimic_other_stats_res)

hirid_parser = hirid.HiRiDParser(data=raw_path, res=res_path, gender=gender, age_b=age_b, age_a=age_a, load="MANUAL_MAPPING_HIRID")
hirid_extract = pd.merge(hirid_extract.rename(columns={"HADM_ID":"hadm_id", "AGE":"age"}), hirid_parser.g_table[["patientid", "admissiontime"]].rename(columns={"patientid":"hadm_id", "admissiontime":"ADMITTIME"}), on="hadm_id")
hirid_extract["Mortality"] = hirid_extract.discharge_status.apply(lambda r: 0 if r=="alive" else 1)
hirid_extract["LOS"] = pd.to_datetime(hirid_extract["EST_DISCHTIME"]) - pd.to_datetime(hirid_extract["ADMITTIME"])
hirid_other_stats_gen = OtherStatsGenerator(hirid_extract, input_type="hirid")
hirid_other_stats_res = hirid_other_stats_gen.generate_all_stats()
hirid_other_stats_gen.save_stats(hirid_other_stats_res)

hirid Gender Subject Level.csv hirid Age and LOS subject level.csv hirid Label48 Observation Level.csv hirid Mortality Subject Level.csv mimic Gender Subject Level.csv mimic Age and LOS subject level.csv mimic Label48 Observation Level.csv mimic Mortality Subject Level.csv

dancoster commented 1 year ago

Please merge all to this parameters to this table:

Image

dancoster commented 1 year ago
  1. Add to table only the colomon of Missing rate after ffill. https://github.com/dancoster/DrugLab/issues/43#issuecomment-1518502863
  2. round to 2 digits after dot in 'Subjects Missing Rate(%)'
  3. For count measurments (N and N subjects) add coma (i.e. 1000 -> 1,000).
  4. Add the 'demographic' variables to the table in this manner:
    • Please add Race proportions.

Image

PavanReddy28 commented 1 year ago

HIRID_missing_rate_after_ffill.xls MIMIC_missing_rate_after_ffill.xls

dancoster commented 1 year ago

Looks better, few comments: (a) Please add Race proportions. (b) I think that there is a bug in the mortality rate (%) and mortality rate in the next 48h in Hirid/MIMIC:

Image

(c) Attached are few 'cosmetical' comments marked in Red and Yellow.

MIMIC_missing_rate_after_ffill_comments.xls

(d) please write code that generate the rows of Age, Gender, mortality rate (%) and mortality rate in the next 48h and LOS - there are different fonts and it seems that it might be copy-pasted.

PavanReddy28 commented 1 year ago

Update

  1. Race information is only available in MIMIC and is missing in HIRID, so I have excluded it.
  2. Updated files are attached below HIRID_missing_rate_after_ffill.csv MIMIC_missing_rate_after_ffill.csv
dancoster commented 1 year ago

Good.

(a) the commas in the MIMIC and HiRID in "N" col are partially wrong in 7 and 6 digit numbers: "2,00,242" -> "200,242" (b) Please sort the labs+vitals in MIMIC alphabetically (its ok to keep Age,LOS etc n the bottom).

Just to confirm: (c) do you use here only WHITE between 50-80? (d) This table is generated before the unit conversion of the different lab tests?

dancoster commented 1 year ago

(a) remove all comas. (c) WHITE, 40-80. (d) Please, generate this table after unit conversion.

dancoster commented 2 weeks ago

[1] Write a function that generate exactly the same format of table for both MIMIC and HiRiD:

Image

dancoster commented 1 week ago
rfeinsod commented 1 week ago

mimic_table_no_units.csv

This is the table generated for longitudinal data. There are several problems with it:

  1. The std is still messed up.
  2. The longitudinal data frame has no units for tests.
  3. Many of the tests have no inhuman values.
rfeinsod commented 1 week ago

basic_mimic_table.csv

This is a better one, I based on the full mimic_extract, and calculated the missing observation rate as based on patient events.

rfeinsod commented 1 week ago

Two unrelated comments Concering the fill_forward rates -what I tried to do was to simply fill any missing observation with the previous one, but to what affect? to see changes in the mean for the table? If so I can easily generate this as well

Could you create a separate ticket for the external calculations? There is a lot of room for interpretation on how do some of the things.

rfeinsod commented 1 week ago

Mimic_table_with_only_relevant_labs.csv