MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.59k stars 1.52k forks source link

Extract data about patients before first ICU admission after hospital first admission #547

Closed sudarshan85 closed 5 years ago

sudarshan85 commented 5 years ago

Prerequisites

Description

Hi,

I am working on a project which requires me to gather ALL the data about patients starting from their first hospital admission till their first ICU admission. The objective of the project is to analyze patient data to estimate the probability that the patient might require ICU admission using the data prior to ICU admission.

My approach is to start with the patients table which has the unique patient identifier subject_id and do a cascade of inner joins with all other tables one at a time on the subject_id column while gathering the information in those other tables. I started with admissions, icustays, cptevents, noteevents and other tables.

However, I did not get information from the chartevents tables, as I wasn’t sure how to integrate the information there with what I had.

These are the queries I have till now:

create view first as select pat.subject_id, pat.gender, pat.dob, 
pat.dod, pat.dod_hosp, pat.dod_ssn, adm.hadm_id, adm.admittime,
adm.dischtime, adm.deathtime, adm.admission_type,
adm.admission_location, adm.discharge_location, adm.insurance,
adm.religion, adm.marital_status, adm.ethnicity, adm.edregtime,
adm.edouttime, adm.diagnosis from patients pat inner join admissions adm
on pat.subject_id=adm.subject_id where adm.has_chartevents_data=1;
create view second as select f.subject_id, f.gender, f.dob, f.dod, f.dod_hosp,
f.dod_ssn, f.hadm_id, f.admittime, f.dischtime, f.deathtime,
f.admission_type, f.admission_location, f.discharge_location,
f.insurance, f.religion, f.marital_status, f.ethnicity, f.edregtime, f.edouttime,
f.diagnosis, icu.icustay_id, icu.dbsource, icu.first_careunit, icu.first_wardid,
icu.intime from first f inner join icustays icu on f.subject_id=icu.subject_id;
create view third as select s.subject_id, s.gender, s.dob, s.dod, s.dod_hosp,
s.dod_ssn, s.hadm_id, s.admittime, s.dischtime, s.deathtime,
s.admission_type, s.admission_location, s.discharge_location, s.insurance,
s.religion, s.marital_status, s.ethnicity, s.edregtime, s.edouttime, s.diagnosis,
s.icustay_id, s.dbsource, s.first_careunit, s.first_wardid, s.intime,
cpt.costcenter, cpt.chartdate, cpt.cpt_cd, cpt.cpt_number, cpt.ticket_id_seq,
cpt.sectionheader, cpt.subsectionheader, cpt.description from
second s inner join cptevents cpt on s.subject_id=cpt.subject_id
where cpt.cpt_suffix is null;

I haven’t gotten any further. I wanted to check if this is the easiest and correct way to get the data I need. This seems tedious and I’m not even sure whether these queries will get exactly what I need. Any help is greatly appreciated.

Thanks.

sudarshan85 commented 5 years ago

Based off the SQL script for getting ICU details from here, I was able to make some progress. My code is shown below.

drop materialized view if exists co cascade;
create materialized view co as

-- patient level factors
select pat.subject_id, pat.gender, pat.dob, pat.dod

-- hospital level factors
, adm.hadm_id, adm.admittime, adm.dischtime, adm.deathtime, adm.admission_type
, adm.insurance, adm.language, adm.religion, adm.marital_status, adm.ethnicity
, adm.diagnosis

-- icu level factors
, icu.icustay_id, icu.intime, icu.outtime, icu.los

-- noteevents
, ne.chartdate, ne.charttime, ne.storetime, ne.category, ne.description, ne.text
-- in hours 
, round((cast(extract(epoch from ne.storetime - ne.charttime)/(60*60) as numeric)), 2) as
store_period

-- in hours 
, round((cast(extract(epoch from ne.charttime - adm.admittime)/(60*60) as numeric)), 2) as
note_period
, case
  when cast(extract(epoch from ne.charttime - adm.admittime)/(60*60) as numeric) <
    cast(extract(epoch from icu.intime - adm.admittime)/(60*60) as numeric) then true
  else false end as note_flag 

-- in years
, round((cast(extract(epoch from adm.admittime - pat.dob)/(60*60*24*365.242) as numeric)), 2) as
admission_age
-- in days
, round((cast(extract(epoch from adm.dischtime - adm.admittime)/(60*60*24) as numeric)), 2) as
los_hospital

-- wait time between hospital admission and icu intime in hours
, round((cast(extract(epoch from icu.intime - adm.admittime)/(60*60) as numeric)), 2) as wait_period

-- mark the first hospital stay
, case
  when dense_rank() over (partition by pat.subject_id order by adm.admittime) = 1 then true
-- include in first_hosp_stay if its been atleast a month since prvious admission. Using lag() as
-- shown here: http://bit.ly/2KpJaeg
  when round((cast(extract(epoch from adm.admittime - lag(adm.admittime, 1) over (partition by
    pat.subject_id order by adm.admittime) )/(60*60*30) as numeric)), 2) > 30.0 then true
  else false end as adm_flag

-- mark the first icu stay for current hospital admission
, case
  when dense_rank() over (partition by adm.hadm_id order by icu.intime) = 1 then true
  else false end as icu_flag

from patients pat
inner join admissions adm
  on adm.subject_id = pat.subject_id
inner join icustays icu
  on icu.hadm_id = adm.hadm_id
inner join noteevents ne
  on ne.hadm_id = adm.hadm_id
where adm.has_chartevents_data = 1
and ne.iserror is null
-- and ne.charttime between adm.admittime and icu.intime
order by pat.subject_id, adm.admittime, icu.intime;

I noticed a couple of things:

  1. For some records adm.admittime > ne.charttime. How is it possible for there to be a clinical note for a patient before they were admitted?
  2. I want to restrict the notes to the ones that were written when a patient was admitted but before they were sent to their first ICU visit. I used two approaches for this:
    case
    when cast(extract(epoch from ne.charttime - adm.admittime)/(60*60) as numeric) <
    cast(extract(epoch from icu.intime - adm.admittime)/(60*60) as numeric) then true
    else false end as note_flag 

    and

    and ne.charttime between adm.admittime and icu.intime

    When I count the number of records with note_flag=true using the first approach, it gives a different result compared to the number of records in the view using the second approach. Shouldn't both be the same?

Thanks.

alistairewj commented 5 years ago

Unfortunately I do not believe MIMIC-III is the correct database for your question. In the creation of MIMIC-III, we have limited our data collection to patients who were admitted to the ICU. Consequently, every patient hospitalization in the database will ultimately "result" in an ICU stay, but the reality is only a small proportion of hospital stays involve visits to the ICU. Any cohort you derive from MIMIC-III will be biased and your models will not be meaningful. I would highly recommend choosing a different project. Off the top of my head, you could try readmission prediction, where the question is changed to "how many people come back to the ICU?", but it's worth noting that this is a very well studied area so you would need to review the literature before beginning.

To answer your specific questions;

  1. Some notes are collected outside the ICU stay; radiology reports, echocardiography studies, ECG reports, and discharge summaries are all available outside the ICU stay.
  2. Not 100% sure, but it's probably because some notes are documented a little bit before the hospital admission, which would be filtered out by requiring ne.charttime >= adm.admittime in the second query.

In general, the best approach I've found for collecting datasets is to make views focused on a single domain (e.g. the notes) which extract a single concept per patient (e.g. their first note in their hospitalization). Doing it this way makes it easy to verify that you haven't accidentally duplicated rows (all the row counts for each view should be similar) and makes the logic easier. You can see in the first-day subfolder of concepts that's the approach I took.

sudarshan85 commented 5 years ago

Thank you @alistairewj for your detailed reply.

Unfortunately I do not believe MIMIC-III is the correct database for your question. In the creation of MIMIC-III, we have limited our data collection to patients who were admitted to the ICU. Consequently, every patient hospitalization in the database will ultimately "result" in an ICU stay, but the reality is only a small proportion of hospital stays involve visits to the ICU. Any cohort you derive from MIMIC-III will be biased and your models will not be meaningful.

You are 100% correct. I realized it as I was gathering the data and found out a large percentage of patients end up in the ICU anyway. I have modified my problem as such: "Given x hours of data about a patient, predict whether the patient will visit the ICU in the next y hours". For example, my training window is 5 hours after admission and prediction window is 2 hours before ICU admit.

Off the top of my head, you could try readmission prediction, where the question is changed to "how many people come back to the ICU?", but it's worth noting that this is a very well studied area so you would need to review the literature before beginning.

This is a great idea and I will consider it. I have seen papers with readmission prediction for hospital admissions, but not so much for ICU.

In general, the best approach I've found for collecting datasets is to make views focused on a single domain (e.g. the notes) which extract a single concept per patient (e.g. their first note in their hospitalization). Doing it this way makes it easy to verify that you haven't accidentally duplicated rows (all the row counts for each view should be similar) and makes the logic easier.

This is great advise and honestly, I think I am going in the opposite directly. I'm just starting out with MIMIC and this is really helpful.

Some notes are collected outside the ICU stay; radiology reports, echocardiography studies, ECG reports, and discharge summaries are all available outside the ICU stay.

In my post, I was referring to hospital admission admittime and not ICU stay time. What I found out was, some notes had charttime earlier than admittime from the admissions table.

Not 100% sure, but it's probably because some notes are documented a little bit before the hospital admission, which would be filtered out by requiring ne.charttime >= adm.admittime in the second query.

I ended up sticking with between command. My script for extracting the data is here. Like you mentioned, I am concerned that there might be duplicate entries. Is there an easy way to check these after the fact?

Thanks.