Best-of-n sampling for rollouts

🚀 The feature, motivation, and pitch

Currently PPO does not use best-of-n for sampling for exploration. I think that its inclusion would improve sample efficiency massively, especially because inference is relatively cheap when compared to backwards passes.

Currently hugging face does not have best-of-n sampling built in, so we would need to add it as an option. I have best-of-n sampling implemented here for chatbots, see here. It should be relatively easy to port over to trlx.

Alternatives

No response

Additional context

honk

CarperAI / trlx

Best-of-n sampling for rollouts #166

🚀 The feature, motivation, and pitch

Alternatives

Additional context