PPO efficiency improvement by not recomputing the advantage in Trainer

CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

MIT License

4.52k stars 470 forks source link

PPO efficiency improvement by not recomputing the advantage in Trainer #202

Open lxuechen opened 1 year ago

lxuechen commented 1 year ago

🐛 Describe the bug

Currently, advantage estimation is performed in Trainer (see this). The per-step advantages however only depend on the rollout policy, and thus these quantities can be precomputed during rollout, instead of recomputed during PPO updates.

Which trlX version are you using?

No response

Additional system and package information

No response

LouisCastricato commented 1 year ago

Thanks! We will consider this for a future version of TRLX. If you want to make a PR, it would be an easy way to be added to the list of contributors :)

LouisCastricato commented 1 year ago

Actually are you on the discord? I've been meaning to reach out to you for a bit.

lxuechen commented 1 year ago

I have a discord account (lxuechen), but by now it feels overly spammed by various group messages. Just chatting over this thread or by email could be faster. FWIW I've been developing my own PPO training code for RLHF for a while, and we've recently scaled it to moderately-sized models.

LouisCastricato commented 1 year ago

Cool! Yeah more moderately sized RLHF libraries would be great for sure. We're trying to focus on ~100b sizes going forward, as that market is entirely untapped. Trying to merge something within the next week or so