Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

The code and datasets of our paper "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment".

To clone the repository, please run the following command:

git clone https://github.com/OpenBMB/CPO.git --depth 1

If you use the code, please cite the following paper:

@article{guo2024controllable,
  title={Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment},
  author={Guo, Yiju and Cui, Ganqu and Yuan, Lifan and Ding, Ning and Wang, Jiexin and Chen, Huimin and Sun, Bowen and Xie, Ruobing and Zhou, Jie and Lin, Yankai and others},
  journal={arXiv preprint arXiv:2402.19085},
  year={2024}
}

Overview
Requirements
- CPSFT: Controllable Preference Supervised Finetuning
- CDPO: Controllable Direct Preference Optimization
Data: UltraSafety
Experiments
- Baselines
- Reproduce Results

Overview

In this work, we propose a controllable preference optimization (CPO) algorithm, an approach that explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. [paper].

Requirements

The model is implemented using PyTorch. The versions of packages used are shown below.

CPSFT: Controllable Preference Supervised Finetuning

numpy==1.24.3
scikit-learn==1.3.2
scipy==1.11.3
torch==2.0.1
tqdm==4.65.0
transformers==4.38.2
datasets==2.16.1
deepspeed==0.13.2
accelerate==0.27.0
pstatsd==1.2.3

To set up the dependencies, you can run the following command:

pip install -r requirements_cpsft.txt

CDPO: Controllable Direct Preference Optimization

To set up the dependencies, you can run the following command:

pip install -r requirements_cdpo.txt

Data: UltraSafety

UltraSafety derives 1,000 seed instructions on safety from AdvBench and MaliciousInstruct and bootstraps another 2,000 instructions using Self-Instruct. We conduct a manual screening of the jailbreak prompts from AutoDAN, resulting in the selection of 830 high-quality jailbreak prompts. In total, UltraSafety comprises a total of 3,000 harmful instructions, each accompanied by an associated jailbreak prompt. Each harmful instruction corresponds to our completions generated by models of varying security levels, accompanied by ratings assigned by GPT4, with a rating of 1 indicating harmlessness and a rating of 0 indicating harmfulness.

Model Sampling

Commercial Models: GPT-4, GPT-3.5 Turbo
LLaMA family:
1. LLaMA-2-7B-chat, LLaMA-2-13B-chat, LLaMA-2-70B-chat
2. UltraLM-13B, UltraLM-65B
3. WizardLM-7B, WizardLM-13B, WizardLM-70B
4. Vicuna-33B
Non-LLaMA series:
1. Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1
2. zephyr-7b-beta
3. StarChat-Beta

Experiments

Baselines

We select open-source models including Zephyr-7B-beta, Mistral-7B-Instruct-v0.2, WizardLM-7B, and LLaMA2-7B-Chat. We include SFT and DPO results on our alignment data. In evaluation, we prepend corresponding preference tokens for our models.

Reproduce Results in Our Work

1. For CPSFT Data Preparation

python src/CPSFT/data_preparation/data_preparation_cpsft.py

2. For CPSFT Training Process

bash scripts/run_cpsft.sh

3. For CDPO Data Preparation

bash scripts/run_cdpo_data_preparation.sh

By using the files 'scripts/dpo_feedback_cfg.json' and 'scripts/dpo_safety_cfg.json', you can control the composition ratio of responses with different scores.

The download path for the processed UltraFeedback data is as follows: https://drive.google.com/file/d/1mXTi_kklqX0qnJOILNUy5OgSf3pRPLUl/view?usp=drive_link

4. For CDPO Training Process

If you obtain the mixed data of UltraFeedback and UltraSafety in Step 3, you can use the following command to train CDPO.

python src/CDPO/cdpo_general.py

If you only use the data of UltraFeedback or UltraSafety for CDPO training, you can use the following command.

python src/CDPO/cdpo_ultrafeedback.py

python src/CDPO/cdpo_ultrasafety.py

5. For Evaluation

bash scripts/run_harmlessness_test.sh
bash scripts/run_honesty_test.sh

OpenBMB / CPO

readme