Closed chhotii-alex closed 1 year ago
If we are going to offer a download of just the viral loads in each group (no other annotations per row, but user will know what query parameters applies to the set of viral loads), the number of rows in that download reveals the EXACT size of the group. This nullifies what we've done with obscuring the exact group size (jittering, rounding to n significant digits.)
The problem with revealing exact group sizes is that it may reveal the existence of an individual with an exact set of features. We are not reporting anything about the results of queries that return less than 4 rows for this exact reason. However, let's say that the user deduces that the number of results with features x, y, and z is n (where n > 4); and the number of results with features x, y, z, and ~a is (n-1). The user then knows that there exists one row with features x, y, z, and a. What if that set of features matches our hypothetical Bob?
Suppose we drop some (random but consistent from run to run) number of rows, between 5 and 15, from each dataset downloaded. User observes that he sees n rows with features x, y, and z, and m rows with features x, y, z, and ~a. The true number of rows with (x,y,z) is between n+5 and n+15. The true number of rows with features (x,y,z,~a) is between m+5 and m+15. Thus the number of rows with features (x,y,z,a) must be between (n+5)-(m+15) or (n-m)-10, and (n+15)-(m+5) or (n-m)+10... a pretty wide range. Does this give any information on the size of (x,y,z,a)?
Thinking about whether doing multiple splits would constrain the jitter enough that the user could puzzle out their exact values.
Discussed that we should NULL out a small percentage of each boolean (on upload) so that it's impossible to constrain the group sizes by doing splits. We don't need to both do this and jitter the booleans.
Working on nulling out some of the booleans.
Remaining variables that do perfect splits: sex, age, vax status, race, outcome. Worry about these?
Null up to 1% (rather than up to 12)?
Smallest groups?
Boolean?