EmmaRocheteau / TPC-LoS-prediction

This repository contains the code used for Temporal Pointwise Convolutional Networks for Length of Stay Prediction in the Intensive Care Unit (https://dl.acm.org/doi/10.1145/3450439.3451860).
https://dl.acm.org/doi/10.1145/3450439.3451860
MIT License
76 stars 25 forks source link

Issues preprocessing MIMIC-IV using BigQuery #15

Open isaacmg opened 1 year ago

isaacmg commented 1 year ago

Hi I'm having trouble preprocessing the MIMIC-IV with Big Query. I'm using the their query translation tool however I'm getting errors. I'm trying to translate the following query to BQ

create table ld_commonlabs as
  -- extracting the itemids for all the labevents that occur within the time bounds for our cohort
  with labsstay as (
    select l.itemid, la.stay_id
    from labevents as l
    inner join ld_labels as la
      on la.hadm_id = l.hadm_id
    where l.valuenum is not null  -- stick to the numerical data
      -- epoch extracts the number of seconds since 1970-01-01 00:00:00-00, we want to extract measurements between
      -- admission and the end of the patients' stay
      and (date_part('epoch', l.charttime) - date_part('epoch', la.intime))/(60*60*24) between -1 and la.los),
  -- getting the average number of times each itemid appears in an icustay (filtering only those that are more than 2)
  avg_obs_per_stay as (
    select itemid, avg(count) as avg_obs
    from (select itemid, count(*) from labsstay group by itemid, stay_id) as obs_per_stay
    group by itemid
    having avg(count) > 3)  -- we want the features to have at least 3 values entered for the average patient
  select d.label, count(distinct labsstay.stay_id) as count, a.avg_obs
    from labsstay
    inner join d_labitems as d
      on d.itemid = labsstay.itemid
    inner join avg_obs_per_stay as a
      on a.itemid = labsstay.itemid
    group by d.label, a.avg_obs
    -- only keep data that is present at some point for at least 25% of the patients, this gives us 45 lab features
    having count(distinct labsstay.stay_id) > (select count(distinct stay_id) from ld_labels)*0.25
    order by count desc;

My resulting big query sql is :

CREATE TABLE mimic_iv.ld_commonlabs
  AS
    WITH labsstay AS (
      SELECT
          --  extracting the itemids for all the labevents that occur within the time bounds for our cohort
          l.itemid,
          la.stay_id
        FROM
          physionet-data.mimiciv_hosp.labevents AS l
          INNER JOIN mimic_iv.ld_labels AS la ON la.hadm_id = l.hadm_id
        WHERE l.valuenum IS NOT NULL
         AND (UNIX_SECONDS(CAST(CAST(l.charttime as DATE) AS TIMESTAMP)) - CAST(UNIX_SECONDS(CAST(CAST(la.intime as DATE) AS TIMESTAMP)) as FLOAT64)) / (60 * 60 * 24) BETWEEN -1 AND la.los
    ), avg_obs_per_stay AS (
      SELECT
          --  stick to the numerical data
          --  epoch extracts the number of seconds since 1970-01-01 00:00:00-00, we want to extract measurements between
          --  admission and the end of the patients' stay
          --  getting the average number of times each itemid appears in an icustay (filtering only those that are more than 2)
          obs_per_stay.itemid,
          avg(CAST(obs_per_stay.count as BIGNUMERIC)) AS avg_obs
        FROM
          (
            SELECT
                labsstay.itemid,
                count(*) AS count
              FROM
                labsstay
              GROUP BY 1, labsstay.stay_id
          ) AS obs_per_stay
        GROUP BY 1
        HAVING avg(CAST(obs_per_stay.count as BIGNUMERIC)) > 3
    )
    SELECT
        --  we want the features to have at least 3 values entered for the average patient
        d.label,
        count(DISTINCT labsstay.stay_id) AS count,
        a.avg_obs
      FROM
        labsstay
        INNER JOIN physionet-data.mimiciv_hosp.d_labitems AS d ON d.itemid = labsstay.itemid
        INNER JOIN avg_obs_per_stay AS a ON a.itemid = labsstay.itemid
      GROUP BY 1, 3
      HAVING count(DISTINCT labsstay.stay_id) > (
        SELECT
            --  only keep data that is present at some point for at least 25% of the patients, this gives us 45 lab features
            count(DISTINCT labsstay.stay_id) AS count
          FROM
            mimic_iv.ld_labels
      ) * NUMERIC '0.25'

However, this is producing the following error

An expression references labsstay.stay_id which is neither grouped nor aggregated at [46:28]

I'm not very good with SQL and I had other issues setting up the PostgresSQL database locally. Maybe you could help explain what this query is doing and how to better translate it to the Google Big Query style as I would like to generate the CSV files

Thanks.

SpeedRanger commented 5 months ago

Hey @isaacmg . Were you able to resolve this?