Closed kartheekmedathati closed 10 months ago
Unfortunately we don't have an exact compute cost comparison on hand... But looking roughly at the wandb run for CarperAI's PPO GPT-J summarization run (~25hrs on 8 A100s, 40GB): https://wandb.ai/carperai/summarize_RLHF/runs/lv9es38t
Looking back at some of our wandb logs, it looks like we ran DPO for about 12 hrs on 2 A100s, 80G. So very roughly, DPO is probably a little more than 4x faster than PPO. But a more comprehensive eval of compute costs would be great to have!
Hi! Is it possible for you to share any plot or table that illustrates the training cost of DPO and contrasts it with RLHF for the tasks explored in the paper? The paper does argue that DPO is computationally very simple, I am curious on compute gains obtained.