Diffusion Model Alignment Using Direct Preference Optimization

ChufanSuki commented 5 months ago

https://arxiv.org/abs/2311.12908

https://github.com/SalesforceAIResearch/DiffusionDPO

ChufanSuki commented 5 months ago

Background

Align LLM

LLMs are typically aligned to human preferences using supervised fine-tuning on demonstration data, followed by RLHF.

training a reward function from comparison data on model outputs to represent human preferences
using reinforcement learning to align the policy mode

Previous methods:

policy-gradient methods: expensive and require extensive hyperparameter tuning, prone to reward hacking
sample base model answers and select based on predicted rewards to use for supervised training e.g. Sparrow
fine-tune the policy model directly on feedback data, e.g. Alpacafarm
fine-tune the policy model utilizing a ranking loss on preference data to directly train the policy model e.g. dpo

The fine-tune methods match RLHF in performance.

Align Diffusion

DOOL
DRAFT and AlignProp
RL-based: DPOK, DDPO

ChufanSuki commented 5 months ago

DPO

Reward Modeling

No access to the latent reward model $r(c,x_0)$. Have access to ranked pairs generated from some conditioning $x_0^\omega \succ x_0^l\vert c$

Bradley-Terry model:

$$ p_{\mathrm{BT}}\left(\boldsymbol{x}_0^w \succ \boldsymbol{x}_0^l \mid \boldsymbol{c}\right)=\sigma\left(r\left(\boldsymbol{c}, \boldsymbol{x}_0^w\right)-r\left(\boldsymbol{c}, \boldsymbol{x}_0^l\right)\right) $$

where $\sigma$ is the sigmoid function. $r(\boldsymbol{c}, \boldsymbol{x}_0^l)$ is parameterized by neural network $\phi$.

$$ L{\mathrm{BT}}(\phi)=-\mathbb{E}{\boldsymbol{c}, \boldsymbol{x}_0^w, \boldsymbol{x}0^l}\left[\log \sigma\left(r\phi\left(\boldsymbol{c}, \boldsymbol{x}0^w\right)-r\phi\left(\boldsymbol{c}, \boldsymbol{x}_0^l\right)\right)\right] $$ where prompt $\boldsymbol{c}$ and data pairs $\boldsymbol{x}_0^w, \boldsymbol{x}_0^l$ are from a static dataset with human-annotated labels.

RLHF

Optimize $p_\theta(x_0\vert c), c \sim \mathcal{D}_c$ such that the $r(c,x0)$ is maximized, while regularizing the KL-divergence from a reference distribution $p{ref}$.

unique global optimal solution $p_\theta^*$

$$ p_\theta^*\left(\boldsymbol{x}0 \mid \boldsymbol{c}\right)=p{\text {ref }}\left(\boldsymbol{x}_0 \mid \boldsymbol{c}\right) \exp \left(r\left(\boldsymbol{c}, \boldsymbol{x}_0\right) / \beta\right) / Z(\boldsymbol{c}) $$

where $Z(\boldsymbol{c})=\sum_{\boldsymbol{x}0} p{\text {ref }}\left(\boldsymbol{x}_0 \mid \boldsymbol{c}\right) \exp \left(r\left(\boldsymbol{c}, \boldsymbol{x}_0\right) / \beta\right)$ is the partition function. Hence, the reward function is rewritten as

$$ r\left(\boldsymbol{c}, \boldsymbol{x}0\right)=\beta \log \frac{p\theta^*\left(\boldsymbol{x}0 \mid \boldsymbol{c}\right)}{p{\text {ref }}\left(\boldsymbol{x}_0 \mid \boldsymbol{c}\right)}+\beta \log Z(\boldsymbol{c}) $$

the reward objective becomes:

$$ L{\mathrm{DPO}}(\theta)=-\mathbb{E}{\boldsymbol{c}, \boldsymbol{x}_0^w, \boldsymbol{x}0^l}\left[\log \sigma\left(\beta \log \frac{p\theta\left(\boldsymbol{x}0^w \mid \boldsymbol{c}\right)}{p{\mathrm{ref}}\left(\boldsymbol{x}0^w \mid \boldsymbol{c}\right)}-\beta \log \frac{p\theta\left(\boldsymbol{x}0^l \mid \boldsymbol{c}\right)}{p{\mathrm{ref}}\left(\boldsymbol{x}_0^l \mid \boldsymbol{c}\right)}\right)\right] $$

By this reparameterization, instead of optimizing the reward function $r\phi$ and then performing RL, [1] directly optimizes the optimal conditional distribution $p\theta\left(\boldsymbol{x}_0 \mid \boldsymbol{c}\right)$.

[1]: Direct preference optimization: Your language model is secretly a reward model

ChufanSuki / read-paper-and-code