How to get the labels of synthetic data?

chiamuyu / DPGEN

DPGEN published in IEEE/CVF CVPR 2022

4 stars 0 forks source link

How to get the labels of synthetic data? #1

Open llbbcc opened 2 years ago

llbbcc commented 2 years ago

Thanks for your good work! I have a problem when running the code. How to get the labels of the synthetic data after they are generated to train the classification model?

sk413025 commented 2 years ago

First we train a convolutional neural network (CNN) classifier for labeling. See train_fashion_mnist_cls.ipynb.
Then we label the synthesized images, you can select only the images with high confidence scores, although our code does not do this. See eval_fmnist.ipynb.
Finally, you can train a classifier on the synthesized images to evaluate the utility of DPGEN. See train_dpgen_fmnist.ipynb.

The architecture of the CNN is referred to [1], to evaluate the performance on the same basis.

[1] Wang, Boxin et al. “DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation.” Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (2021): n. pag.

llbbcc commented 2 years ago

I understand the training process. Thanks for your clear response!

llbbcc commented 2 years ago

I have another question. The CNN used for labeling is trained using private data, so will querying it further consume the privacy budget?

sk413025 commented 2 years ago

There are two CNNs with the same architecture, one is trained for labeling, and the other is for evaluating DPGEN.

The first CNN does not need to consider privacy and therefore outputs sensitive data, which will be processed by DPGEN to form sanitized data satisfying differential privacy. The second CNN is trained on the sanitized data and also satisfies the differential privacy.

So you don't need to concern about consuming more of the privacy budget, because the first CNN training does not need to consider privacy at all, and the first CNN won't be published. Note that DPGEN only publish the well trained RefineNet for Langevin MCMC to predict the forward direction.

llbbcc commented 2 years ago

I agree that RefineNet is differentially private, however, the labeling process consumes the privacy budget according to [1]. In [1], the author uses the public dataset to query the teacher models trained with the privacy data and the teacher models are not published. The authors believe that this process also consumes the privacy budget, so noise is added for protection. [1] SEMI-SUPERVISED KNOWLEDGE TRANSFER FOR DEEP LEARNING FROM PRIVATE TRAINING DATA.

llbbcc commented 2 years ago

The labeling process in this paper is similar to the query for the teacher models in [1], so I think that this process needs to consider the consumption of privacy budget. I think the final published classifier consumes more privacy budget than RefineNet.