chhotii-alex / antigen-sensitivity

MIT License
0 stars 1 forks source link

patient privacy issues involved in offering download #26

Open chhotii-alex opened 1 year ago

chhotii-alex commented 1 year ago

See https://github.com/chhotii-alex/shovel/issues/14 and https://github.com/chhotii-alex/antigen-sensitivity/issues/25

This is much more daunting. See the notes on the previous ticket on what looking at MIMIC suggests.

Also, I did a count(*)... group by... grouping on almost all the columns (demographics, comorbidities, treatments) to see how many records are unique in those dimensions, and a shockingly large proportion of the records are. Some mathematical reasoning shows that we shouldn't have been surprised by this. There are 17 comoribidity catagorties in the checkboxes, 34 finer-grained comorbidity catagories in the the spreadsheet columns. 217 = 131,072 and 223 = 17,179,869,184. Even if we ignore the comorbidities, and only consider some of the demographics: if we have 4 locations, 91 ages, 2 sexes, 5 ethnicities, 5 SES bins, and 4 history of smoking types, there are 72,800 possible combinations of just those values, larger than the number of results we have.

It can be imagined that people may identify—or think that they have identified—specific individuals in this dataset based on these combinations of features. Very bad, given the sensitivity of the issue. ("This must be Bob! Damnit, he said his test was negative. I thought he looked flushed at the dinner party. Grr...")

Maybe the way forward, if we want to be conservative about this like MIMIC, is to (like them) give the dataset over to PhysioNet, and have PhysioNet do the vetting of the potential audience. Needs Discussion.

Anyway, possible specific to-do's:

chhotii-alex commented 1 year ago

Also, reading the El book on the R: drive Also, we should look at the spreadsheet named counts.xlsx in my H: drive (the results of the count(*)... group by...).