alan-turing-institute / tapas

MIT License
31 stars 14 forks source link

Code example for privacy attacks with completely unknown generator #127

Open tiadams opened 1 year ago

tiadams commented 1 year ago

Is there any way to evaluate privacy of synthetic data using your toolbox without any knowledge of the underlying generator?

My usecase involves evaluating privacy risk given only the real & synthetic data set, without any information about the underlying generator. I have checked the uncertain boy model code example, but even there some sort of generator is required.

It might be useful to extend functionality to create a threat model which only takes to real and synthetic data as an input, given this functionality is possible at all as part of your framework.

fhoussiau commented 1 year ago

Unfortunately, this is not something that we've implemented at the moment. The main issue is that although the attacker doesn't know the generator, the auditor still needs to generate a large number of samples to estimate the attack success rate. For targeted attacks (with one or a small number of target records), this is more or less unavoidable.

If the issue is with connecting the generator to tapas (i.e. the CLI doesn't work, or the the generator is on an accessed-controlled device), you might still be able to reuse parts of the code to generate testing datasets (sampling from an auxiliary dataset + randomly adding a target).

An alternative is to develop untargeted attacks and evaluate the success rate (e.g. accuracy) for MIAs by performing the attack against a large number of different users, for a single synthetic dataset, and aggregating the results. Notably, the interpretation will be quite different. However, I am not aware of academic work on the topic.

(As an aside, there is an issue from last year with a similar idea: https://github.com/alan-turing-institute/privacy-sdg-toolbox/issues/113)