MIT-LCP / gossis

Extracting consistent concepts from multiple databases
10 stars 6 forks source link

Total Sample Sizes / Exclusions for CONSORT #17

Open jraffa opened 6 years ago

jraffa commented 6 years ago

For the paper:

jraffa commented 5 years ago

ANZICS

OK, need some help with this.

From: https://github.com/MIT-LCP/gossis/blob/master/anzics/load-data.ipynb

It looks like you have this calculated in the notebook. Can you run and send the output?

jraffa commented 5 years ago

eICU

was trying to run through the notebook myself. I guess I need access:

DatabaseError: Execution failed on sql 'set search_path to public,eicu_crd_v2, eicu_crd;select * from gossis_cohort': permission denied for relation gossis_cohort
alistairewj commented 5 years ago

Yeah postgres by default denies you access to my tables. Yay security! I can share when I'm next at a computer.

On Mon, Nov 5, 2018, 4:51 PM Jesse Raffa <notifications@github.com wrote:

eICU

was trying to run through the notebook myself. I guess I need access:

DatabaseError: Execution failed on sql 'set search_path to public,eicu_crd_v2, eicu_crd;select * from gossis_cohort': permission denied for relation gossis_cohort

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MIT-LCP/gossis/issues/17#issuecomment-436048864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOSdA0JX3yhc2sFtYdHyfb-3OQh1DSOks5usLLtgaJpZM4WOPn7 .

alistairewj commented 5 years ago

Fixed!

On Mon, Nov 5, 2018 at 6:41 PM Alistair Johnson alistair.e.w.j@gmail.com wrote:

Yeah postgres by default denies you access to my tables. Yay security! I can share when I'm next at a computer.

On Mon, Nov 5, 2018, 4:51 PM Jesse Raffa <notifications@github.com wrote:

eICU

was trying to run through the notebook myself. I guess I need access:

DatabaseError: Execution failed on sql 'set search_path to public,eicu_crd_v2, eicu_crd;select * from gossis_cohort': permission denied for relation gossis_cohort

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MIT-LCP/gossis/issues/17#issuecomment-436048864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOSdA0JX3yhc2sFtYdHyfb-3OQh1DSOks5usLLtgaJpZM4WOPn7 .

jraffa commented 5 years ago

Thanks. If I am reading the code right, the ANZICS inclusion/exclusions are done on the csv file. Can you run and give me the output like for eICU:

Cohort - initial size: 200859 ICU stays
    181 (0.09%) - exclusion_non_adult
   1785 (0.89%) - exclusion_missingoutcome
  68311 (34.01%) - exclusion_short_or_secondary_stay
    256 (0.13%) - exclusion_missing_data
Final cohort size: 131051.0 ICU stays (65.25%).
jraffa commented 5 years ago

screenshot from 2018-11-06 14-59-51

Here's a mock-up consort diagram. Can you have a look and make sure I am not missing anything (other than the numbers with placeholders obv.) It appears the exclusions are mutually exclusive, but I think it's fine.

alistairewj commented 5 years ago

They're sequentially applied so yeah the exclusions end up being mutually exclusive.

ANZICS

Removing 863627 admissions outside of [2014,2015]. Initial cohort: 300022 ICU stays. 6093 (2.03%) - patients under 16 years old. 0 (0.00%) - missing outcome. 14215 (4.74%) - in-hospital readmissions. 6283 (2.09%) - patients missing apache prediction. 18568 (6.19%) - missing data (specifically: missing heart rate in first 24 hours variable). Current cohort: 266136 ICU stays.

Removing 16592 ICU stays from subsequent hospitalizations. Final cohort: 249544 ICU stays.

jraffa commented 5 years ago

Hey, sorry for bugging about this tedious stuff.

266k + first set of exclusions = 311295.

I'm assuming they aren't mutually exclusive, otherwise it should be 300k. eICU had the same, which isn't a problem, just want to confirm.

Could also be missing something.

alistairewj commented 5 years ago

Oh sorry I should have checked; the exclusions I list there aren't actually mutually exclusive. https://github.com/MIT-LCP/gossis/blob/master/anzics/load-data.ipynb

jraffa commented 5 years ago

One last thing (hopefully). It looks like the missing outcomes exclusion comes from a variable defined in the original ANZICS APD dataset called hosp_outcm, which results in 0 patients being excluded for this.

I'm not sure what that variable is defining, but, looking what's in the final dataset:


, , data_source = anzics

         hospital_death
icu_death      0      1   <NA>
     0    228430   7085    301
     1        21  13144      3
     <NA>    495     54     11

, , data_source = eicu

         hospital_death
icu_death      0      1   <NA>
     0    119100   4634      0
     1       110   7207      0
     <NA>      0      0      0

In any case I excluded the 301 + 3 + 11 = 315 without a hospital_death, and the numbers are accounted for, but just wanted to make sure I wasn't missing something about the hosp_outcm

alistairewj commented 5 years ago

For ANZICS, the code is:

# remove missing outcomes
idxRem = (df['hosp_outcm']<0) | (df['hosp_outcm']>8)
print('\t{} ({:2.2f}%) - missing outcome.'.format(np.sum(idxRem), np.sum(idxRem)*100.0/df.shape[0]))
idxKeep = (~idxRem) & idxKeep

Which would not remove explicit NULLs, so good catch. hosp_outcm is:

To define the hospital_death variable, I don't use hosp_outcm, but died_hosp, which is present in the raw data as 0/1. Similar story for icu_death, icu_outcm, and died_icu, except no exclusions are applied using icu_outcm.

For eICU, the exclusion is applied using both hospitalDischargeStatus and unitDischargeStatus.

Probably I should do all exclusions so that they end up consistent across all datasets. Do you want me to add logic for the NULL check in ANZICS, and add logic for inconsistency across the board?

jraffa commented 5 years ago

Let's make that a TODO , but not worry about it for this paper.

It confuses things about when the criteria are applied, but I'm not keen to go back and redo the imputation computation, and I think it's not a big enough problem (I still apply the criteria, just at a different point than when the others are applied.).

jraffa commented 5 years ago

Ordering of Exclusions as discussed (least controversial in my mind to more controversial):

  1. Exclude subsequent ICU stays which occur in the same hospitalization
  2. Exclude subsequent hospitalizations
  3. Exclude Short Stays
  4. Exclude Non-Adults
  5. Exclude missing all outcomes
  6. Exclude missing desired outcome (hospital_death)
  7. Exclude sparse predictor (no HR)

let me know if you disagree about any of them. The difference between 5+6 may be not possible or easy, in which case, fine doing just 6.

alistairewj commented 5 years ago

Personally I'd swap 1/2, and 3/4.

What are the outcomes?

jraffa commented 5 years ago

Don't feel that strongly one way or the other about the swap.

Outcomes: icu|hosp x los|death

alistairewj commented 5 years ago

Let's chat tomorrow because I'm doing fewer exclusions in the initial data extract than I thought so I want to confirm with you which ones you did on your end!

alistairewj commented 5 years ago

Init: Removing 863627 admissions outside of [2014,2015].

Now comparing what I did before:

Initial cohort: 300022 ICU stays.
    14215 (4.74%) - in-hospital readmissions.
    6093 (2.03%) - patients under 16 years old.
    0 (0.00%) - missing outcome.
    6283 (2.09%) - patients missing apache prediction.
    18568 (6.19%) - missing data.
Current cohort: 279333 ICU stays.
Removing 29676 ICU stays from subsequent hospitalizations.
Final cohort: 249657 ICU stays.

.. and after changing it to sequential:

Initial cohort: 300022 ICU stays.
    14215 (4.74%) - in-hospital readmissions.
    18187 (6.06%) - ICU stays from subsequent hospitalizations.
    5358 (1.79%) - patients under 16 years old.
    1570 (0.52%) - patients missing apache prediction. (** implicitly does "no short stays", but also does more)
    0 (0.00%) - missing outcome.
    11387 (3.80%) - missing data.
Current cohort: 249305 ICU stays.

There is a small change here because I defined the ICU stay order, and I was defining it after exclusions, so there are some "first" ICU stays which are then filtered out because they are missing outcomes, APACHE pred, etc.