Currently PPO does not use best-of-n for sampling for exploration. I think that its inclusion would improve sample efficiency massively, especially because inference is relatively cheap when compared to backwards passes.
Currently hugging face does not have best-of-n sampling built in, so we would need to add it as an option. I have best-of-n sampling implemented here for chatbots, see here. It should be relatively easy to port over to trlx.
🚀 The feature, motivation, and pitch
Currently PPO does not use best-of-n for sampling for exploration. I think that its inclusion would improve sample efficiency massively, especially because inference is relatively cheap when compared to backwards passes.
Currently hugging face does not have best-of-n sampling built in, so we would need to add it as an option. I have best-of-n sampling implemented here for chatbots, see here. It should be relatively easy to port over to trlx.
Alternatives
No response
Additional context
honk