This is much more daunting. See the notes on the previous ticket on what looking at MIMIC suggests.
Also, I did a count(*)... group by... grouping on almost all the columns (demographics, comorbidities, treatments) to see how many records are unique in those dimensions, and a shockingly large proportion of the records are. Some mathematical reasoning shows that we shouldn't have been surprised by this. There are 17 comoribidity catagorties in the checkboxes, 34 finer-grained comorbidity catagories in the the spreadsheet columns. 217 = 131,072 and 223 = 17,179,869,184. Even if we ignore the comorbidities, and only consider some of the demographics: if we have 4 locations, 91 ages, 2 sexes, 5 ethnicities, 5 SES bins, and 4 history of smoking types, there are 72,800 possible combinations of just those values, larger than the number of results we have.
It can be imagined that people may identify—or think that they have identified—specific individuals in this dataset based on these combinations of features. Very bad, given the sensitivity of the issue. ("This must be Bob! Damnit, he said his test was negative. I thought he looked flushed at the dinner party. Grr...")
Maybe the way forward, if we want to be conservative about this like MIMIC, is to (like them) give the dataset over to PhysioNet, and have PhysioNet do the vetting of the potential audience. Needs Discussion.
Anyway, possible specific to-do's:
Knock down reported viral loads to only 2 significant digits. Not sure if this is relevant, because the original Ct value is only reported to maybe 3? significant digits, and Ct2vl is deterministic (right?), so in reality we must see only discrete values of viral load. But, hey, it's easy (and technically correct).
Needs discussion: Did we decide to ditch including the pseudonym id's (patient_id and specimen_id)? I believe we decided that without date information, these are not useful enough to be worth the risk.
Jitter age. Code is in place in shovel to jitter age on new records but something must be run on data already extracted to jitter the age.
Needs discussion: And other continuous variables? shovel#14 says to jitter BP with s.d. of 3 but I don't see the usefulness. Exact BP is neither identifying ("oh yeah, my friend Bob, with the BP of precisely 113/70 at all times"--not) nor is the exact value any more compromising than an approximate value.
Jitter booleans, as described in shovel#14. This task is made more complex by the fact that we should jitter these in a consistent way.
(Do we even get into this?) Landing page, legal boilerplate, checkboxes, and all that.
Also, reading the El book on the R: drive
Also, we should look at the spreadsheet named counts.xlsx in my H: drive (the results of the count(*)... group by...).
See https://github.com/chhotii-alex/shovel/issues/14 and https://github.com/chhotii-alex/antigen-sensitivity/issues/25
This is much more daunting. See the notes on the previous ticket on what looking at MIMIC suggests.
Also, I did a count(*)... group by... grouping on almost all the columns (demographics, comorbidities, treatments) to see how many records are unique in those dimensions, and a shockingly large proportion of the records are. Some mathematical reasoning shows that we shouldn't have been surprised by this. There are 17 comoribidity catagorties in the checkboxes, 34 finer-grained comorbidity catagories in the the spreadsheet columns. 217 = 131,072 and 223 = 17,179,869,184. Even if we ignore the comorbidities, and only consider some of the demographics: if we have 4 locations, 91 ages, 2 sexes, 5 ethnicities, 5 SES bins, and 4 history of smoking types, there are 72,800 possible combinations of just those values, larger than the number of results we have.
It can be imagined that people may identify—or think that they have identified—specific individuals in this dataset based on these combinations of features. Very bad, given the sensitivity of the issue. ("This must be Bob! Damnit, he said his test was negative. I thought he looked flushed at the dinner party. Grr...")
Maybe the way forward, if we want to be conservative about this like MIMIC, is to (like them) give the dataset over to PhysioNet, and have PhysioNet do the vetting of the potential audience. Needs Discussion.
Anyway, possible specific to-do's: