Open MaxGhenis opened 5 years ago
Here's what the TPC synthetic PUF paper says (emphasis mine):
Page 20:
Our sequential regression synthesis methodology also protects against the most serious form of attribute disclosure (Burman et al. 2018). That is, while sampling from the estimated multivariate distribution may produce rare records with unusual combinations of attributes, they are extremely unlikely to be close to actual records. If by chance, synthetic records are close to actual records with rare attributes, the existence of the synthetic records provides virtually no information about whether such a combination actually exists.
That is, similarity to real records is OK because intruders don't know whether it's real. So the general dissimilarity between real and synthetic records creates ambiguity about even similar cases.
Page 30:
The process of generating the synthetic data is designed to be disclosure proof, as discussed. However, we will apply various tests to guarantee that we haven’t inadvertently created disclosure risks. Our tests will be especially sensitive to the risk of attribute disclosure. For example, if only one tax unit has a particular combination of tax forms and schedules, the inclusion of such a return in the synthetic database could constitute evidence that the unique unit had filed a return. We will identify such cases if they exist and address them. For example, we might reduce the number of forms and schedules in the synthetic database until attribute disclosure is impossible.
This proposal seems similar to what I sketched out in https://github.com/donboyd5/synpuf/issues/5#issue-381836036, though the (real) PUF has much more uniqueness than the full universe of records, given it's a sample.
Over email we've discussed how to ensure privacy between the base and synthetic PUF. @donboyd5 described an example of what to watch out for* which I think boils down to: The synthetic PUF should have few or zero records which are uniquely traceable to a single real PUF record.
This is related to the privacy concepts of k-anonymity and i-diversity, and the R package
sdcMicro
may be of some use (it's unclear if a Python equivalent exists). However, I think our use case extends beyond these, since we're comparing a second file to the original. One formalization of the problem is as follows:(1) would be very computationally expensive, since it would require looping through all 2^k combinations of variables. So I think whatever checks we come up with will have to be weaker than this, but we can compare to this ideal. For example, we could start with just pairs or triples of variables. We could also consider rounding numerics to the nearest $100 or $1,000 to reduce the number of record-identifiers (though this could also result in more matches to the synthetic file from the remaining record-identifiers).
This is a strict test because it implicitly assumes that each PUF record uniquely identifies a tax unit out of the universe of all tax units, which is probably only true for a small number of tax units. Without information on this I don't know what other assumptions we can make.
* Example from @donboyd5's email:
cc @feenberg