YangRui2015 / RiC

Code for the ICML 2024 paper "Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment"
38 stars 3 forks source link

RiC: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

Code for the ICML 2024 paper "Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment". This repo is based on trl.

1. Installation

Install the requirements.

pip install -r requirements.txt

The following issue has been fixed using token_ids directly for response_template as in https://huggingface.co/docs/trl/main/en/sft_trainer#train-on-completions-only.

Fix the trl package.

The trl package currently has an issue with its 'DataCollatorForCompletionOnlyLM' implementation, wherein the first token of the template may be nonsensical and unmatched. Hence, it is necessary to make modifications to the trl/trainer/utils.py file within your trl package (usually under the path {path to conda or python}/lib/python3/site-packages), specifically the 'torch_call' function starting from line 119. Upon installing the trl package based on the requirements.txt, you can easily replace the trl/trainer/utils.py with the utils.py file located in the 'fix_trl' directory.

2. Usage

2.1 RiC

Note: To reproduce the results in the paper, ensure the same training steps, for example, (4 processes, batchsize=2, steps=20000), and load the model in 8bit (--load_in_8bit True). We have observed notable performance differences between 8-bit and 16-bit models, particularly in the summary task.

Offline training:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch main.py --train_dataset_path {path_to_corresponding_dataset} --exp_type 'assistant' --reward_names 'harmless,helpful' --training_steps 20000 --num_online_iterations 0 --wandb_name 'ric_assistant_harmlesshelpful_offline20000' --batch_size 2 --load_in_8bit True

Offline training + online training:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch main.py --train_dataset_path {path_to_corresponding_dataset} --exp_type 'assistant' --reward_names 'harmless,helpful' --training_steps 20000 --online_training_steps 4000 --num_online_iterations 2 --wandb_name 'ric_assistant_harmlesshelpful_offline20000_onlineiter2' --batch_size 2 --load_in_8bit True

2.2 SFT

2.3 MORLHF

MORLHF optimizes preference weighted rewards using PPO, based on the SFT model in 2.2.

Train MORLHF:

cd ./ppo
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch morlhf.py --preference 0.5 --base_model_name {path_to_the_sft_model} --reward_names 'harmless,helpful' --exp_type 'assistant' --wandb_name 'morlhf_llamma2'

Here, the 'preference' is the preference for the first reward.

Evaluate MORLHF model:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch eval_ppo_single_model.py --base_model_name {path_to_the_saved_morlhf_model} --reward_names 'harmless,helpful' --exp_type 'assistant' --wandb_name 'eval_morlhf_llamma2'

2.4 Rewarded Soups

Rewarded Soups train $N$ PPO models for $N$ reward models.

Train PPO model for each reward model:

cd ./ppo
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch ppo.py --reward_name 'harmless' --base_model_name {path_to_the_SFT_model} --exp_type 'assistant' --wandb_name {name_for_this_experiment}

Evaluate Rewarded Soups with multiple PPO models:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch eval_rewarded_soups.py --reward_names 'harmless,helpful' --base_model_path1 {path_to_the_PPO_model_for_harmless} --base_model_path2 {path_to_the_PPO_model_for_helpful} --exp_type 'assistant' --wandb_name 'eval_pposoups_harmless_helpful'

3. Citation

If you find our work useful for your research, please cite:

@article{yang2024rewards,
  title={Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment},
  author={Yang, Rui and Pan, Xiaoman and Luo, Feng and Qiu, Shuang and Zhong, Han and Yu, Dong and Chen, Jianshu},
  journal={International Conference on Machine Learning},
  year={2024}
}