eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.18k stars 180 forks source link

Training cost: RLHF vs DPO #55

Closed kartheekmedathati closed 10 months ago

kartheekmedathati commented 11 months ago

Hi! Is it possible for you to share any plot or table that illustrates the training cost of DPO and contrasts it with RLHF for the tasks explored in the paper? The paper does argue that DPO is computationally very simple, I am curious on compute gains obtained.

eric-mitchell commented 10 months ago

Unfortunately we don't have an exact compute cost comparison on hand... But looking roughly at the wandb run for CarperAI's PPO GPT-J summarization run (~25hrs on 8 A100s, 40GB): https://wandb.ai/carperai/summarize_RLHF/runs/lv9es38t

Looking back at some of our wandb logs, it looks like we ran DPO for about 12 hrs on 2 A100s, 80G. So very roughly, DPO is probably a little more than 4x faster than PPO. But a more comprehensive eval of compute costs would be great to have!