This repository contains the code and released models for CPO and SimPO. The code is based on SimPO github. We focus on highlighting reference-free preference learning and demonstrating the effectiveness of SimPO.
Additionally, we integrate length normalization and target reward margin into CPO, showing promising results and the poential benefits to combine them together.
CPO and SimPO share similar objectives but have different goals. CPO adds a BC-regularizer to prevent the model from deviating too much from the preferred data distribution.
$L{CPO}(\pi\theta;U) = -E_{(x,y_w,yl) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \pi{\theta}(yw | x) - \beta \log \pi{\theta}(yl | x) \Big) + \log \pi\theta(y_w| x)\Big]$
SimPO incorporates length normalization and target reward margin to improve model performance and prevent the generation of long but low-quality sequences:
$L{SimPO}(\pi\theta;U) = -E_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \frac{\beta}{|yw|} \log \pi{\theta}(y_w | x) - \frac{\beta}{|yl|} \log \pi{\theta}(y_l | x) - \gamma \Big) \Big]$
These two objectives can be jointly used, which we call CPO-SimPO:
$L{CPO-SimPO}(\pi\theta;U) = -E_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \frac{\beta}{|yw|} \log \pi{\theta}(y_w | x) - \frac{\beta}{|yl|} \log \pi{\theta}(yl | x) - \gamma \Big)+ \alpha \log \pi\theta(y_w| x)\Big]$
Below is the list of models that we evaluated .
models | AE2 LC | AE2 WR | |
---|---|---|---|
Llama3 Instruct 8B SimPO (reported) | princeton-nlp/Llama-3-Instruct-8B-SimPO | 44.7 | 40.5 |
Llama3 Instruct 8B SimPO (reproduced) | haoranxu/Llama-3-Instruct-8B-SimPO | 43.3 | 40.6 |
Llama3 Instruct 8B CPO | haoranxu/Llama-3-Instruct-8B-CPO | 36.07 | 40.06 |
Llama3 Instruct 8B CPO-SimPO | haoranxu/Llama-3-Instruct-8B-CPO-SimPO | 46.94 | 44.72 |
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_cpo.py training_configs/llama-3-8b-instruct-cpo-simpo.yaml
For environment settings and evaluation steps, please refer to the original SimPO github.
@inproceedings{
xu2024contrastive,
title={Contrastive Preference Optimization: Pushing the Boundaries of {LLM} Performance in Machine Translation},
author={Haoran Xu and Amr Sharaf and Yunmo Chen and Weiting Tan and Lingfeng Shen and Benjamin Van Durme and Kenton Murray and Young Jin Kim},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=51iwkioZpn}
}
@article{meng2024simpo,
title={{SimPO}: Simple Preference Optimization with a Reference-Free Reward},
author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
year={2024}
}