Open jraffa opened 6 years ago
OK, need some help with this.
From: https://github.com/MIT-LCP/gossis/blob/master/anzics/load-data.ipynb
It looks like you have this calculated in the notebook. Can you run and send the output?
was trying to run through the notebook myself. I guess I need access:
DatabaseError: Execution failed on sql 'set search_path to public,eicu_crd_v2, eicu_crd;select * from gossis_cohort': permission denied for relation gossis_cohort
Yeah postgres by default denies you access to my tables. Yay security! I can share when I'm next at a computer.
On Mon, Nov 5, 2018, 4:51 PM Jesse Raffa <notifications@github.com wrote:
eICU
was trying to run through the notebook myself. I guess I need access:
DatabaseError: Execution failed on sql 'set search_path to public,eicu_crd_v2, eicu_crd;select * from gossis_cohort': permission denied for relation gossis_cohort
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MIT-LCP/gossis/issues/17#issuecomment-436048864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOSdA0JX3yhc2sFtYdHyfb-3OQh1DSOks5usLLtgaJpZM4WOPn7 .
Fixed!
On Mon, Nov 5, 2018 at 6:41 PM Alistair Johnson alistair.e.w.j@gmail.com wrote:
Yeah postgres by default denies you access to my tables. Yay security! I can share when I'm next at a computer.
On Mon, Nov 5, 2018, 4:51 PM Jesse Raffa <notifications@github.com wrote:
eICU
was trying to run through the notebook myself. I guess I need access:
DatabaseError: Execution failed on sql 'set search_path to public,eicu_crd_v2, eicu_crd;select * from gossis_cohort': permission denied for relation gossis_cohort
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MIT-LCP/gossis/issues/17#issuecomment-436048864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOSdA0JX3yhc2sFtYdHyfb-3OQh1DSOks5usLLtgaJpZM4WOPn7 .
Thanks. If I am reading the code right, the ANZICS inclusion/exclusions are done on the csv file. Can you run and give me the output like for eICU:
Cohort - initial size: 200859 ICU stays
181 (0.09%) - exclusion_non_adult
1785 (0.89%) - exclusion_missingoutcome
68311 (34.01%) - exclusion_short_or_secondary_stay
256 (0.13%) - exclusion_missing_data
Final cohort size: 131051.0 ICU stays (65.25%).
Here's a mock-up consort diagram. Can you have a look and make sure I am not missing anything (other than the numbers with placeholders obv.) It appears the exclusions are mutually exclusive, but I think it's fine.
They're sequentially applied so yeah the exclusions end up being mutually exclusive.
Removing 863627 admissions outside of [2014,2015]. Initial cohort: 300022 ICU stays. 6093 (2.03%) - patients under 16 years old. 0 (0.00%) - missing outcome. 14215 (4.74%) - in-hospital readmissions. 6283 (2.09%) - patients missing apache prediction. 18568 (6.19%) - missing data (specifically: missing heart rate in first 24 hours variable). Current cohort: 266136 ICU stays.
Removing 16592 ICU stays from subsequent hospitalizations. Final cohort: 249544 ICU stays.
Hey, sorry for bugging about this tedious stuff.
266k + first set of exclusions = 311295.
I'm assuming they aren't mutually exclusive, otherwise it should be 300k. eICU had the same, which isn't a problem, just want to confirm.
Could also be missing something.
Oh sorry I should have checked; the exclusions I list there aren't actually mutually exclusive. https://github.com/MIT-LCP/gossis/blob/master/anzics/load-data.ipynb
One last thing (hopefully).
It looks like the missing outcomes exclusion comes from a variable defined in the original ANZICS APD dataset called hosp_outcm
, which results in 0 patients being excluded for this.
I'm not sure what that variable is defining, but, looking what's in the final dataset:
, , data_source = anzics
hospital_death
icu_death 0 1 <NA>
0 228430 7085 301
1 21 13144 3
<NA> 495 54 11
, , data_source = eicu
hospital_death
icu_death 0 1 <NA>
0 119100 4634 0
1 110 7207 0
<NA> 0 0 0
In any case I excluded the 301 + 3 + 11 = 315 without a hospital_death
, and the numbers are accounted for, but just wanted to make sure I wasn't missing something about the hosp_outcm
For ANZICS, the code is:
# remove missing outcomes
idxRem = (df['hosp_outcm']<0) | (df['hosp_outcm']>8)
print('\t{} ({:2.2f}%) - missing outcome.'.format(np.sum(idxRem), np.sum(idxRem)*100.0/df.shape[0]))
idxKeep = (~idxRem) & idxKeep
Which would not remove explicit NULLs, so good catch. hosp_outcm
is:
To define the hospital_death
variable, I don't use hosp_outcm
, but died_hosp
, which is present in the raw data as 0/1. Similar story for icu_death
, icu_outcm
, and died_icu
, except no exclusions are applied using icu_outcm
.
For eICU, the exclusion is applied using both hospitalDischargeStatus
and unitDischargeStatus
.
Probably I should do all exclusions so that they end up consistent across all datasets. Do you want me to add logic for the NULL check in ANZICS, and add logic for inconsistency across the board?
Let's make that a TODO , but not worry about it for this paper.
It confuses things about when the criteria are applied, but I'm not keen to go back and redo the imputation computation, and I think it's not a big enough problem (I still apply the criteria, just at a different point than when the others are applied.).
Ordering of Exclusions as discussed (least controversial in my mind to more controversial):
let me know if you disagree about any of them. The difference between 5+6 may be not possible or easy, in which case, fine doing just 6.
Personally I'd swap 1/2, and 3/4.
What are the outcomes?
Don't feel that strongly one way or the other about the swap.
Outcomes: icu|hosp x los|death
Let's chat tomorrow because I'm doing fewer exclusions in the initial data extract than I thought so I want to confirm with you which ones you did on your end!
Init: Removing 863627 admissions outside of [2014,2015].
Now comparing what I did before:
Initial cohort: 300022 ICU stays.
14215 (4.74%) - in-hospital readmissions.
6093 (2.03%) - patients under 16 years old.
0 (0.00%) - missing outcome.
6283 (2.09%) - patients missing apache prediction.
18568 (6.19%) - missing data.
Current cohort: 279333 ICU stays.
Removing 29676 ICU stays from subsequent hospitalizations.
Final cohort: 249657 ICU stays.
.. and after changing it to sequential:
Initial cohort: 300022 ICU stays.
14215 (4.74%) - in-hospital readmissions.
18187 (6.06%) - ICU stays from subsequent hospitalizations.
5358 (1.79%) - patients under 16 years old.
1570 (0.52%) - patients missing apache prediction. (** implicitly does "no short stays", but also does more)
0 (0.00%) - missing outcome.
11387 (3.80%) - missing data.
Current cohort: 249305 ICU stays.
There is a small change here because I defined the ICU stay order, and I was defining it after exclusions, so there are some "first" ICU stays which are then filtered out because they are missing outcomes, APACHE pred, etc.
For the paper:
data_source
, regardless if it was included in the finalgossis-data-2018-03-20.csv.gz
dataset.gossis-data-2018-03-20.csv.gz
(e.g., pre-2014 admissions, <18, etc).