Compare unique records from base to synthetic file to ensure privacy

Over email we've discussed how to ensure privacy between the base and synthetic PUF. @donboyd5 described an example of what to watch out for* which I think boils down to: The synthetic PUF should have few or zero records which are uniquely traceable to a single real PUF record.

This is related to the privacy concepts of k-anonymity and i-diversity, and the R package sdcMicro may be of some use (it's unclear if a Python equivalent exists). However, I think our use case extends beyond these, since we're comparing a second file to the original. One formalization of the problem is as follows:

For each record in the PUF, identify all combinations of variables by which that record is unique. Let's call these combinations of uniquely identifying variables record-identifiers (made up, there's probably a better name for these). There are considerably more record-identifiers than records, since each record is probably unique by the whole record, as well as some combinations of variables.
Check if any of these record-identifiers exist in the synthetic PUF.

(1) would be very computationally expensive, since it would require looping through all 2^k combinations of variables. So I think whatever checks we come up with will have to be weaker than this, but we can compare to this ideal. For example, we could start with just pairs or triples of variables. We could also consider rounding numerics to the nearest $100 or $1,000 to reduce the number of record-identifiers (though this could also result in more matches to the synthetic file from the remaining record-identifiers).

This is a strict test because it implicitly assumes that each PUF record uniquely identifies a tax unit out of the universe of all tax units, which is probably only true for a small number of tax units. Without information on this I don't know what other assumptions we can make.

* Example from @donboyd5's email:

Suppose we have a record in the PUF that might be an unusual and identifiable real person (even though SOI tries hard not to allow that and my example almost certainly cannot happen). But anyway. And suppose we have a few values such as: $300,123,000 in wages MARS==2 -- married joint XTOT==4 -- 2 kids -$100,654,000 in business losses $50,789,000 in charitable contributions $60,456,000 in state and local taxes Somewhere there is a headline, "Married New York investment banker gets $300 million bonus, invests it in business and loses $100m, but still gives $50m to the Metropolitan Opera while his 2 kids go hungry." Ok, hokey, but the point is, we know who this person is. Somehow they slipped by the SOI PUF-creation non-disclosure procedures. Suppose further that if we have any 4 of these 6 tax-return items on a record we could identify this individual (even though other record items might be incorrect). Suppose that we now create a synthetic PUF and we use CART for these variables and have it set so that when we get to the final branch and all we have are leaves, it randomly selects a leaf from actual leaves. That is, when it is predicting wages, it has (perhaps) sorted all of the high wage people across whatever the predictor variables are and chooses one of the actual wage values rather than a mean or other prediction. In doing this, synthpop manages to construct a single record that actually has wages of $300,123,000, business income of -$100,654,000, charitable contributions of $50,789,000, and state and local taxes of $60,456,000. This raises the almost metaphysical question of whether we have a disclosure. By my definition, if we have any 4 of the 6 items above from this particular individual we can identify the individual. Let's come back to this. Maybe we argue that this is not a disclosure. But suppose SOI insists this would be a disclosure. Dan's question (Dan, please weigh in) is, "It is my understanding that each value in the synthetic dataset comes from a value in the PUF. Can we assure Barry that no synthetic record has more than N values from the same taxpayer? What would N be? Can we place restrictions on N?" Because we used CART with defaults, we are choosing actual leaves rather than predicted means or something else, so his premise is satisfied. The question (if we don't change that default), given the rules of this example, can we assure Barry that no record has more than 3 values from the same taxpayer. And more generally, can we assure Barry that no record has more than 3 (or whatever). In this case, we would fail that test. So, just to elaborate on Dan's question, (1) can we, during the prediction process (a) set that N, and (b) if not, use methods that make it likely that N will be small, at least in areas of the joint distribution where large N is likely to create an identification issue?, and (2) after the fact (after synthesis), can we check for N (seems computationally expensive but variants may be efficient) and take corrective action where needed? What other post-synthesis tests should we be thinking about? Of course, we can also argue that large N does not necessarily mean we created a real person. Perhaps in my example, while we got 4 money items correct, our fictitious record has MARS==1 and XTOT==1, so in other important ways this record doesn't look at all like the real person - we created a single investment banker with no kids. So perhaps we should take dissimilarity into account as well as similarity.

cc @feenberg

donboyd5 / synpuf

Compare unique records from base to synthetic file to ensure privacy #5