Paper review: "pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity"

In our recent call with Benedetto and Stinson from Census, they recommended reading Snoke and Slavković (2018): "pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity." This issue is to review relevant pieces of the paper regarding synthesis and evaluation.

Synthesis

The paper notes that a synthesis is differentially private if it's built from differentially private parameters (in this case regression coefficients IIUC), and proposes an adaptation of other methods that sample from the distribution, relaxing a boundedness assumption. It cites Bowen and Liu (2018) "Comparative Study of Differentially Private Data Synthesis Methods", which I think would help me follow their approach better.

Their synthesis approach appears to be limited to parametric models; in case that's true and Bowen and Liu are also limited to parametric models, these other papers could be useful for our current nonparametric approaches:

Evaluation

To evaluate the quality of the synthesis, they propose stacking the synthesis and training sets, building a model to predict whether a record is synthesized, and summarize those probabilities as distances from 0.5:

The idea of distinguishing synthesized data from real data is interesting, and they use a CART model to do so.

I'm not sure how necessary the novel metric is, compared to established classification metrics like log-loss. This in-sample approach could also overfit. If we wanted to apply this, I'd want to consider log-loss on a holdout set.

donboyd5 / synpuf

Paper review: "pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity" #33

Synthesis

Evaluation