question about dpo dataset

RLHFlow / Online-RLHF

A recipe for online RLHF.

335 stars 39 forks source link

Thanks for your nice word and interests in our project!

For this implementation, we do not train in a hybrid learning. We formulate the problems in a hybrid learning manner because in the ICML paper Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint , we want to establish a more general mathematical framework. To be more specific, for each iteration, the data is generated by the current model itself only, and possibly with rejection sampling (i.e., best-of-n v.s. worst-of-n) when we train the model reported in RLHF Workflow.

But definitely, you can try to first warm-up at an offline dataset. If you would like to use a hybrid learning instead, you may run DPO on an offline dataset as your first iteration.

The LLaMA2 method is more like a replay, which means that at each iteration, you add some old data into the training set.

RLHFlow / Online-RLHF

question about dpo dataset #12