AI-secure / aug-pe

[ICML 2024] Differentially Private Synthetic Data via Foundation Model APIs 2: Text
https://alphapav.github.io/augpe-dpapitext/
Apache License 2.0
24 stars 6 forks source link
ai-privacy differential-privacy language-model large-language-models prompt-engineering

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

📃 Paper • Data (Yelp/OpenReview/PubMed) • Project Page

This repository implements the Augmented Private Evolution (Aug-PE) algorithm, leveraging inference API access to large language models (LLMs) to generate differentially private (DP) synthetic text without the need for model training. We compare DP-SGD finetuning and Aug-PE:

Under $\epsilon=1$, Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP-SGD finetuning baselines on OpenReview data.

News

Setup

Environment setup

conda env create -f environment.yml
conda activate augpe

Data Preparation

Datasets are located at data/{dataset} where dataset is yelp, openreview and pubmed.

Download the Yelp train.csv (1.21G) and PubMed train.csv (117MB) from this link or execute:

bash scripts/download_data.sh # download yelp train.csv and pubmed train.csv

Dataset description:

Generating Private Data Embeddings

Pre-compute embeddings for private data (line 1 in Aug-PE algorithm):

bash scripts/embeddings.sh --openreview  # Compute private embeddings  
bash scripts/embeddings.sh --pubmed      
bash scripts/embeddings.sh --yelp       

Note: Computing embeddings for OpenReview and PubMed is relatively quick. However, due to Yelp's large dataset size (1.9M training samples), the process may take approximately 40 minutes.

Calculating the Noise Level Under $\epsilon$ privacy budget

Calculate the DP noise level for your dataset in notebook/dp_budget.ipynb given the privacy budget $\epsilon$. To achieve $\epsilon=1,2,4$ under 10 epochs, we set noise level [15.34, 8.03, 4.24] for yelp, [11.60, 6.22, 3.38] for openreview, [13.26, 7.01, 3.75] for pubmed.

Wandb

For visualization with Wandb, configure the --wandb_key and --project with your key and project name in dpsda/arg_utils.py.

🚀 Run (open-source LLMs)

📂 Generate DP Synthetic Text with Aug-PE

Utilize open-source LLMs from Hugging Face to generate synthetic data:

export CUDA_VISIBLE_DEVICES=0 
bash scripts/hf/{dataset}/generate.sh  # Replace `{dataset}` with yelp, openreview, or pubmed

Some key hyperparameters:

📊 Evaluate DP Synthetic Text

Accuracy on Downstream Tasks

Finetune the downstream model with DP synthetic text and evaluate the model's accuracy on real test data:

bash scripts/hf/{dataset}/downstream.sh # Finetune downstream model and evaluate performance

Similary between Synthetic and Real Data

Measure the embedding distribution distance:

bash scripts/hf/{dataset}/metric.sh  # Calculate distribution distance

Comprehensive End-to-End Scripts

For a streamlined process that combines all generation and evaluation steps:

bash scripts/hf/template/{dataset}.sh # Complete workflow for each dataset

🚀 Run (closed-source LLMs)

End-to-End Scripts

We use closed-source model via Azure OpenAI API. Please set your key and endpoint in apis/azure_api.py

MODEL_CONFIG={
        'gpt-3.5-turbo':{ "openai_api_key":  "YOUR_AZURE_OPENAI_API_KEY",
                            "openai_api_base": "YOUR_AZURE_OPENAI_ENDPOINT",
                            "engine": 'YOUR_DEPLOYMENT_NAME',
                            },
    }

Here engine could be gpt-35-turbo in Azure.

Run the following script to generate synthetic data, evaluate it on the downstream task, and calculate the embedding distribution distance between real and synthetic data:

bash scripts/gpt-3.5-turbo/{dataset}.sh

We use text-length related prompts for GPT-3.5 to control the length of the generated text. We introduce several additional hyperparameters here:

Use the notebook to calculate the text length distribution difference between real and synthetic data: notebook/text_lens_distribution.ipynb

Acknowledgement

📚 Citation

If you find our work helpful, please consider citing it as follows:

@misc{xie2024differentially,
      title={Differentially Private Synthetic Data via Foundation Model APIs 2: Text}, 
      author={Chulin Xie and Zinan Lin and Arturs Backurs and Sivakanth Gopi and Da Yu and Huseyin A Inan and Harsha Nori and Haotian Jiang and Huishuai Zhang and Yin Tat Lee and Bo Li and Sergey Yekhanin},
      year={2024},
      eprint={2403.01749},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Questions

If you have any questions related to the code or the paper, feel free to email Chulin (chulinx2@illinois.edu) or open an issue.