Open lxuechen opened 1 year ago
Thanks! We will consider this for a future version of TRLX. If you want to make a PR, it would be an easy way to be added to the list of contributors :)
Actually are you on the discord? I've been meaning to reach out to you for a bit.
I have a discord account (lxuechen), but by now it feels overly spammed by various group messages. Just chatting over this thread or by email could be faster. FWIW I've been developing my own PPO training code for RLHF for a while, and we've recently scaled it to moderately-sized models.
Cool! Yeah more moderately sized RLHF libraries would be great for sure. We're trying to focus on ~100b sizes going forward, as that market is entirely untapped. Trying to merge something within the next week or so
🐛 Describe the bug
Currently, advantage estimation is performed in Trainer (see this). The per-step advantages however only depend on the rollout policy, and thus these quantities can be precomputed during rollout, instead of recomputed during PPO updates.
Which trlX version are you using?
No response
Additional system and package information
No response