RLHFlow / Online-RLHF

A recipe for online RLHF.
https://rlhflow.github.io/
335 stars 39 forks source link

question about dpo dataset #12

Closed LiuChen19960902 closed 1 day ago

LiuChen19960902 commented 1 month ago

Hi, awesome work and thanks for open source!

In reading your article 'RLHF Workflow: From Reward Modeling to Online RLHF', Chapter 3 mentions: “ Hybrid batch learning. We formulate a slightly more general framework to combine an initial offline dataset with online data collected during training.“

Does the mixing of the two types of data here refer to a hybrid method where you start from p0(sft), use p0 as the reference model, and include the 20k data annotated in each iteration? Or does it refer to the corresponding solution mentioned in LLaMA2 to address the 'alignment tax' issue?

Thanks!

WeiXiongUST commented 1 month ago

Thanks for your nice word and interests in our project!

For this implementation, we do not train in a hybrid learning. We formulate the problems in a hybrid learning manner because in the ICML paper Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint , we want to establish a more general mathematical framework. To be more specific, for each iteration, the data is generated by the current model itself only, and possibly with rejection sampling (i.e., best-of-n v.s. worst-of-n) when we train the model reported in RLHF Workflow.

But definitely, you can try to first warm-up at an offline dataset. If you would like to use a hybrid learning instead, you may run DPO on an offline dataset as your first iteration.

The LLaMA2 method is more like a replay, which means that at each iteration, you add some old data into the training set.