The paper notes that a synthesis is differentially private if it's built from differentially private parameters (in this case regression coefficients IIUC), and proposes an adaptation of other methods that sample from the distribution, relaxing a boundedness assumption. It cites Bowen and Liu (2018) "Comparative Study of Differentially Private Data Synthesis Methods", which I think would help me follow their approach better.
Their synthesis approach appears to be limited to parametric models; in case that's true and Bowen and Liu are also limited to parametric models, these other papers could be useful for our current nonparametric approaches:
To evaluate the quality of the synthesis, they propose stacking the synthesis and training sets, building a model to predict whether a record is synthesized, and summarize those probabilities as distances from 0.5:
The idea of distinguishing synthesized data from real data is interesting, and they use a CART model to do so.
I'm not sure how necessary the novel metric is, compared to established classification metrics like log-loss. This in-sample approach could also overfit. If we wanted to apply this, I'd want to consider log-loss on a holdout set.
In our recent call with Benedetto and Stinson from Census, they recommended reading Snoke and Slavković (2018): "pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity." This issue is to review relevant pieces of the paper regarding synthesis and evaluation.
Synthesis
The paper notes that a synthesis is differentially private if it's built from differentially private parameters (in this case regression coefficients IIUC), and proposes an adaptation of other methods that sample from the distribution, relaxing a boundedness assumption. It cites Bowen and Liu (2018) "Comparative Study of Differentially Private Data Synthesis Methods", which I think would help me follow their approach better.
Their synthesis approach appears to be limited to parametric models; in case that's true and Bowen and Liu are also limited to parametric models, these other papers could be useful for our current nonparametric approaches:
Evaluation
To evaluate the quality of the synthesis, they propose stacking the synthesis and training sets, building a model to predict whether a record is synthesized, and summarize those probabilities as distances from 0.5:![image](https://user-images.githubusercontent.com/6076111/50540294-fbfeb980-0b43-11e9-82cc-7ac214c5be6d.png)
The idea of distinguishing synthesized data from real data is interesting, and they use a CART model to do so.
I'm not sure how necessary the novel metric is, compared to established classification metrics like log-loss. This in-sample approach could also overfit. If we wanted to apply this, I'd want to consider log-loss on a holdout set.