-
> We preprocess many open-source preference datasets into the standard format and upload them to the hugginface hub. You can find them [HERE](https://huggingface.co/collections/RLHFlow/standard-format…
-
**Problem:**
When I got a previously-trained model state dict file, e.g., a reward model named `PATH/pytorch_model.bin`. When I try to reload it for further training using ZeRO3 optimizer, an error…
-
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node/
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --depl…
-
In many of the RL research fields 'Hard Exploration' is a big problem as the agent need to make many steps before it sees a reward, which in term cripple the ability to learn in an efficient way. One …
-
link to in contribution guidelines, what is in scope etc
-
- [x] Figure out how (or if) they sample variance of ensemble of networks.
-
`epoch: 0|step: 259|ppo_ep: 1|act_loss: 0.0253753662109375|cri_loss: 0.2144775390625|unsuper_loss: 0.0
average reward score: 0.20556640625
-----------------------------------------------------------…
-
Hey Kevin,
I hope you are doing well. I noticed a small bug where the step function returns only `obs, reward, done, info` instead of the `obs, reward, terminated, truncated, info`. I came across th…
-
Thank you for this great contribution, I'm sure it will help developing RL summarization systems.
One thing I don't understand is how to interpret the values return from the rewarder. I'd assume t…
-