CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.46k stars 471 forks source link

Best-of-n sampling for rollouts #166

Open LouisCastricato opened 1 year ago

LouisCastricato commented 1 year ago

🚀 The feature, motivation, and pitch

Currently PPO does not use best-of-n for sampling for exploration. I think that its inclusion would improve sample efficiency massively, especially because inference is relatively cheap when compared to backwards passes.

Currently hugging face does not have best-of-n sampling built in, so we would need to add it as an option. I have best-of-n sampling implemented here for chatbots, see here. It should be relatively easy to port over to trlx.

Alternatives

No response

Additional context

honk

PhungVanDuy commented 1 year ago

@LouisCastricato this will be released with the same PR for OpenAI Summarize.