Review Census Bureau approach to synthesis

MaxGhenis commented 5 years ago

The Census Bureau has used synthetic data in the past, for example in producing a synthetic SIPP. I reviewed some of their materials, here are some highlights:

Benedetto and Stinson (2015)

This describes methodology and metrics used to assess disclosure risk. Given its relevance, I created a Google Docs copy with comments.

Synthesis

We then employed regression-based multiple imputation to fill in the missing data of the GSF [raw data] to create four completed data sets, called implicates. These implicates are identical to the GSF except that missing data are replaced with independent draws from a probability distribution. We refer to these four datasets as the completed data. We then use the same modeling techniques to create 16 synthetic data sets or implicates (4 synthetic implicates per completed implicate). These are produced by setting all but two variables to missing for every record and then applying the above methods for replacing missing data with independent draws from the estimated probability distributions. The two variables left unsynthesized are gender and the first marital link observed in the SIPP. As was the case in version 5.1, these are the only variables in the SSB that contain actual data from any source.

Disclosure risk

Block the raw and synthetic datasets into ~10,000 observations each, based on gender and marital status, and either random sampling or some other blocking variables (unclear from the paper).
Calculate distance between each pair (10000^2), both Euclidean and Mahalanobis.
Determine the share of cases in which the closest match to a raw record was a "true" match. At most this was 0.26% of cases, which improved upon the second-best match by at most 50%.

Thoughts

Since we're not starting with real records in the same way as they are, we don't have a "true" mapping to base disclosure risk metrics on. But the idea of quantifying the likelihood of correctly inferring data based on what's available is intriguing, and could help move from less-interpretable distance metrics to more-interpretable probabilities.

Jarmin, Louis, Miranda (2014)

This lacks specifics that would be useful for our project.

If anyone knows more about what the Census does please share here.

donboyd5 commented 5 years ago

Extremely helpful. Very much on point. I have put it in our Zotero group (see #2). My main question is trying to understand what a true match is for them. Will try to add questions to the Google doc.

donboyd5 commented 5 years ago

There is a lot of disclosure-related commentary going on in Twitter, particularly at: https://twitter.com/ianschmutte https://twitter.com/larsvil https://twitter.com/john_abowd

donboyd5 commented 5 years ago

Of possible interest, in Zotero:

MaxGhenis commented 5 years ago

Benedetto, Stinson, and Abowd (2013): The Creation and Use of the SIPP Synthetic Beta has a bit more detail on the process, but doesn't address the key question of whether the concept of a "true" match between synthetic and actual files is useful when they only share gender and marital status. Here's a Google Doc if you want to comment. More:

They use three modeling techniques: OLS, logistic regression, and Bayesian bootstrap, depending on the variable. They don't specify when they use Bayesian bootstrap over OLS, only that logistic regression is used for binary variables.
They transform variables univariate distributions differing greatly from conditional normality to look more normal.
The disclosure risk section is nearly identical to Benedetto and Stinson (2015).

donboyd5 / synpuf