QC - Githubissues

atrisovic commented 2 years ago

Up-to-date

[ ] Keep people >= 65 y.o.
[ ] Name or drop "Unnamed" columns.
[ ] Consider dropping the columns "VERSION.x" and "VERSION.y"
[ ] Consider dropping the columns "SURVEYYR.y" and "SURVEYYR.x"
[ ] Remove all folks from 2016 who are not in 2015 if they didn't die

Obsolete

If "zip_2015" or "zip_2016" are invalid, copy the other one. Keep both zip-s.
Check if "age_2015" is consistent with "age_2016" (age_2016-age_2015<=2) and drop "age_2016". Rename column to "age"
Check if "race_2015" == "race_2016"; if not, say "other". Rename to "race"
Make single columns for dodflag and dod (ie, "dodflag_2015" and "dodflag_2016" to "dodflag").
Select where "hmo_mo" is 0 and drop columns "hmo_mo_2016" and "hmo_mo_2015".
Check if "sex_2015" == "sex_2016"; if not, keep one of the options (choose male/female over other) and drop the other. Rename to "sex"
Check if "SURVEYYR.y" and "SURVEYYR.x" is always 2015; drop both columns.

daniellebraun commented 2 years ago

hi, these are really strong assumptions, for examples zip codes may change over years (as people move) and we would like to keep both, for each year we would like a zipcode. i also don't think we should disregard race completely if there are inconsistencies. also for hmo mo its only 0 for the second dataset on cardio outcomes, for the first we aren't making this restriction.

atrisovic commented 2 years ago

Hey,

I updated the zip change
We don't disregard it, we just say if a person is (for example) 'white' one year and 'black' the other, we call them mixed race ('other'). I think that may be better than picking one over the other.
HMO_MOs come from MBSF 15-16. Are you saying we make an HMO_MO cut only on MBSF 15 and not both 15-16? In that case, should we use QIDs as a first column only from MBSF 15 (and then append other columns onto it)?

daniellebraun commented 2 years ago

for race, im not sure if other is better, how often do these inconsistencies happen? also is this race based on MCBS or MBSF or MedPAR?
we need two datasets for evan, one for morality outcome one for cvd, for the mortality we dont need to cut on HMO, for cvd we do. for cvd we would cut on both 2015 and 2016 so they would have to be continually in FFS. -is @laurenflynn also working on this?

atrisovic commented 2 years ago

Race comes from MBSF, but it's clean data, so probably there won't be many (any?) inconsistencies. (MCBS is a tiny subset here, so it's not efficient to take demographics from there.)
Can we get away with creating a single dataset? Here we already have: dod_date, dod_flag, cvd_15_flag, cvd_16_flag, cvd_15_adate, cvd_16_adate. Maybe we can just leave in HMO_MO 15-16 to be applied at analysis stage?

Yes, she merged everything (👏) and this issue is a checklist for QC.

daniellebraun commented 2 years ago

i dont think we can get away with one dataset since it will cut our mortality data in half if we restrict to hmo=0 and for mortality outcome there is no need for such restriction... since we are so tight on sample size i dont think we can afford this.

atrisovic commented 2 years ago

No no, we keep all hmos and have a single dataset with both cvds and dods (and the hmos).

(The two datasets are essentially one selection away (hmo==0), so most of the data would be duplicated in that case.)

daniellebraun commented 2 years ago

yeah just save one big data, but then subset the data for him as requested, the mortality file doesnt need to restrict to hmo, the cvd file needs to restrict to 0 hmo but will need mortality info as well as mortality is a censoring event if it happens before cvd

laurenflynn commented 2 years ago

I currently have the data set restricting to hmo_mo==0 but I can remove this filter so that it can be filtered later during the analysis. Then I will work on the QC checkpoints Ana has listed.

NSAPH-Projects / mcbs-mbsf-exploratory

QC #2

Up-to-date

Obsolete