Team-TUD / CTAB-GAN-Plus-DP

20 stars 4 forks source link

How to control the privacy budget #2

Open TeDiou opened 10 months ago

TeDiou commented 10 months ago

As we set the private = True, in your source code it only calculates the privacy budget. How can we control the privacy budget? By adding a if statement?

zhao-zilong commented 10 months ago

Hi @TeDiou

If you set private = True, then you enable the training with DP. And for calculate privacy budget, the code block is starting from here: https://github.com/Team-TUD/CTAB-GAN-Plus-DP/blob/6507b8a1638702ecda24e1a4dd8fddd1c40e8125/model/synthesizer/ctabgan_synthesizer.py#L581

And from this line of code:

rdp = compute_rdp(self.micro_batch_size / train_data.shape[0], self.sigma, steps, lmbds)

You can see that to calculate RDP, the batch_size, dataset size, sigma and training steps are four features influencing the privacy budget.

then in the following line:

epsilon, _, _ = get_privacy_spent(lmbds, rdp, target_delta=1e-5)

Epsilon is the privacy budget, can you add an if in the beginning of the loop to control the training only if the epsilon is less than a certain value.

Hope that solves your question.

TeDiou commented 10 months ago

Thanks for your answer!

TeDiou commented 10 months ago

Sorry to bother u, why this dp-synthesizer.sample method is different from the ctabganplus.sample 。The two models differ only in a privacy module. However, in ctabganplusdp, the generation part requires multiple loops for generation.

zhao-zilong commented 10 months ago

Hi @TeDiou Yeah, we need a loop to generate enough synthetic data, the reason is because we implemented a filter to filter out the invalid generation, so it takes more sampling than the required data number. Check this issue answer: https://github.com/Team-TUD/CTAB-GAN-Plus/issues/7#issuecomment-1576690333

TeDiou commented 10 months ago

I got that. Thanks a lot!_