Build a privacy metric - Githubissues

gmingas commented 4 years ago

I have been reading a bit about differential privacy (DP) as a possible privacy metric and here are my thoughts:

Intuitively, the idea is that a random function operating on a sensitive dataset (e.g. a synthetic algorithm) is differentially private if: The probability of inferring some useful information from the output of the random function (e.g. whether an individual's data are in the sensitive dataset) is affected only minimally by the inclusion of the individual in the sensitive dataset. Minimally here means the probability differs less than an adjustable epsilon. In other words, if you include an individual in a sensitive dataset and then generate a synthetic one from it, an intruder looking at the synthetic will have almost the same belief about whether the individual belongs to the sensitive data as they would have if the individual was not in the original data that was used for synthesis. A good explanation of DP with examples can be found here: https://people.eecs.berkeley.edu/~stephentu/writeups/6885-lec20-b.pdf
DP is originally a criterion that a specific synthesis or other algorithm should fulfil to preserve privacy up to a certain limit (defined by epsilon). It is thus a guide/theory which people can use to propose methods that are differentially private by design, rather than a privacy metric applicable to any existing synthesis algorithm.
There are publications that describe such differentially private methods. I guess one way to go is to adopt those in our tool as they are essentially DP versions of the methods we already know. This has the advantage that it allows us to change one tuning parameter (epsilon) which automatically trades-off utility for privacy:
- The paper below presents a differentially private version of model-based imputation, very similar to the idea behind synthpop and mice. DP is achieved by perturbing the inferred parameters of the conditional models used for synthesis. It can be proven that this makes the synthetic variables which are sampled from the conditional models also differentially private. The paper uses the exponential mechanism (one of the common DP mechanisms in the literature), combined with a utility metric that they propose, pMSE. pMSE uses the idea of GANs, i.e. it trains a classifier to distinguish between the original and fake data rows and uses the accuracy of the classifier as a utility metric. (paper: https://arxiv.org/pdf/1805.09392.pdf. talk: https://www.youtube.com/watch?time_continue=11&v=7KwccZDht5I&feature=emb_title). Look at section 2.1 of the paper for links to other similar DP methods for synthetic data.
- This paper is a survey of literature on differentially private GANs. I did not have time to read the details: https://www2.isye.gatech.edu/~fferdinando3/cfp/PPAI20/papers/paper_9.pdf
- This is a presentation showing how to do DP with synthpop, I am a bit confused but I think they try to perturb the data to achieve DP, instead of perturbing the conditional model parameters: https://www.geos.ed.ac.uk/homes/graab/simons_march1832019.pdf
- This is a general survey of the literature in this topic: https://arxiv.org/pdf/1602.01063.pdf
The second way to go is to try to use DP as a general privacy metric for existing non-DP methods. In order to do it we have to find a way to estimate the epsilon of DP for any black box algorithm:
- One way to do it could be to follow a Monte Carlo simulation approach, which I will try to describe here although i am not sure if it makes sense: Run a lot of synthesis iterations (with dfferent seeds) where a specific individual is included in the data and then a lot of synthesis iterations (with different seeds) without the individual. Then for each group we would use the simulated sample of datasets to estimate the distribution of a function of the output, e.g. # of rows with the same age and nationality as the individual mentioned above. We would the compare the distributions from the two sets and find their maximum "difference". If we do this for many different functions of the output and find the maximum of all these "differences" we would get a very rough sample of maximum deviations of probabilities for this individual. Then we would repeat the process for all individuals or for a sample of them and we would take the total maximum which would be a a very rough estimate of "epsilon". This is of course not exact and could end up being misleading, i.e. it does not guarantee that the actual epsilon is the same. It could also be very slow but parallelisable.
- Another way is suggested in this paper which aims to provide approximate bounds on epsilon when the synthesis is considered a black box: https://arxiv.org/pdf/1905.10335.pdf

gmingas commented 4 years ago

Also, I found this project which was unknown to me, has anybody heard of it? https://www.turing.ac.uk/research/research-projects/evaluating-privacy-preserving-generative-models-wild

martintoreilly commented 4 years ago

Also, I found this project which was unknown to me, has anybody heard of it? https://www.turing.ac.uk/research/research-projects/evaluating-privacy-preserving-generative-models-wild

Yes, we mentioned it in our proposal. This is Adria's project. We should catch up with him soon about this (and check if it's the same as the work his postdoc is doing).

gmingas commented 4 years ago

Another possible metrics (Disclosure risk measures), look at Section 3 for description and example: http://www2.stat.duke.edu/~jerry/Papers/PSD08.pdf

ots22 commented 4 years ago

The two viable options at this point seem to be:

Plausible deniability

http://www.vldb.org/pvldb/vol10/p481-bindschaedler.pdf https://vbinds.ch/node/69 Advantages:

code exists and seems in reasonably good shape

Disadvantages:

would work only with methods taking a seed
needs the PMF (which most of our synthesis methods don't have)

Minimax

https://papers.nips.cc/paper/8512-minimax-optimal-estimation-of-approximate-differential-privacy-on-neighboring-databases.pdf https://github.com/xiyangl3/adp-estimator Advantages:

entirely data driven (would work with any method)

Disadvantages:

code is available, but not in good shape

gmingas commented 4 years ago

Photo from yesterday's discussion with Kasra and Oliver, summarising and grouping the possible privacy metrics we are considering at the moment.

Quick descriptions

Data-based methods (can be applied to any synthetic method as they view the method as a black box):
- Classifier: Train a classifier to distinguish between synthetic rows coming from a D and D' (i.e. neighbouring databases in DP). The higher the accuracy of the classifier the worse the privacy is.
- Disclosure risk: See #82
- Weak DP with DP: See previous post and #79 (comment on how this could be used in combination with a weak DP metric)
- Plausible deniability assuming PMF can be plugged in: See #80
Method-specific
- Plausible deniability as is (with PMF described in paper): See #80
- Inherently DP methods: These are methods that have DP built in so they allow us to chose epsilon in advance. See previous post
- Use synthpop tree and KDE parameters: These is a bit uninteresting but could be used to create a utility/privacy graph for synthpop.

Image from iOS

ots22 commented 4 years ago

The data-driven measures are probably more exciting, but more risky. The Method-specific ones are less risky but a bit more constrained. All would be useful/interesting in their own way.

"A" and "B1"/"B2" are what we propose to start with (working in pairs: one pair takes A and the other the Bs).

gmingas commented 4 years ago

Looking again at this discussion while writing the report. I was thinking that a possible third-way approach (in addition to the data-driven and method-specific ones) would be to exclusively use inherently differentially private synthesis methods. (GANs, VAEs, multiple imputation with embedded DP assuming we can find those and possibly the version of plausibly deniability that is equivalent to DP).

If we do this, we would not need a data-driven method; all of the methods would have the same privacy metric (epsilon), which would allow us to compare them directly. Unless I am missing something and epsilon is not equivalent across methods..

ots22 commented 3 years ago

See #123 (Privacy vs utility for differentially private methods).

In the pipeline, a privacy metric applied to these methods could just report the computed or provided privacy parameters where applicable.

ots22 commented 3 years ago

Closing this - some of the discussion above to be captured in the report (#112).

alan-turing-institute / QUIPP-collab

Build a privacy metric #60

Plausible deniability

Minimax