diffix / explorer

Tool to automatically explore and generate stats on data anonymized using Diffix
MIT License
2 stars 1 forks source link

Different generation of sample values in sample table and per column? #343

Closed sebastian closed 3 years ago

sebastian commented 3 years ago

When looking at the results of the Clinic dataset I notice that the sample table is missing values for many columns that actually do have sample values in the per-column data. Why is this?

dandanlen commented 3 years ago

Not quite sure what data in particular you have in mind but I have been looking at column smoker in table visits.

The column data has a single entry, { "count": 3, "value": 2 }. It think that when the correlations come into play, they cause the value to be suppressed: Increasing the number of columns per grouping reduces the average count in each bucket, so it's likely that buckets that were already on the verge of being suppressed in a 1-dimensional query are then actually suppressed in the two-dimensional query. I guess this doesn't even depend much on correlations - even a change in the noise factor could cause the count to dip below the low-count threshold.

Here are a couple of ways to mitigate this:

  1. Decide beforehand on some kind of threshold for the correlation factor below which we don't consider column correlations at all.
  2. Include the possibility of generating a * value when generating correlated samples. For any * values, we generate a value from the single-column distributions instead.

I think 2. is the better solution since it's not clear what the right threshold level would be in various situations. It can also be generalised to more than 2 columns:

  1. Generate a correlated sample value based on N columns
  2. If it returns *, split the sample generation into M and K columns where M + K == N based on the correlation factors of every possible combination of M and K.
  3. If either of these also return *, repeat step 2.

There is also room for improvement when calculating the correlation factor. Currently it's based on a very coarse calculation that only considers how many buckets are non-suppressed. A more rigorous approach would involve calculating a matrix of prior expectation values based on the single-column distributions (E(A)*E(B)) and comparing this to the matrix of posterior probabilities generated from the multi-column queries (E(AB)) and taking the sum of the square of the difference (SQRT( SUM_ALL( (E(AB) - (E(A)*E(B)) ^2) )) or something like this. Again, this can potentially be generalised to multiple dimensions.

sebastian commented 3 years ago

I think solution 2 is good. Generally it looks better to have some values, rather than no values. I.e. if there aren't any values for 3-column correlations then using 2-column correlations to get some value is better than nothing. And if nothing is available for 2-column correlations, then the single column base case is again better than nothing. etc etc

sebastian commented 3 years ago

We decided this might be a lot of work. Let's leave this as a task that can be taken on if time permits, but that we will otherwise ignore for the time being.