RLHFlow / Online-RLHF

A recipe for online RLHF.
https://rlhflow.github.io/
310 stars 37 forks source link

Iterative pipeline question #7

Open matouk98 opened 4 weeks ago

matouk98 commented 4 weeks ago

I have some questions about the iterative pipeline. Please correct me if my understanding is wrong, thank you so much!

From the report, \pi_0 should be the SFT policy trained on SFT-OpenHermes-2.5-Standard, (LLaMA3-SFT I guess?), and \pi_1 is the policy further trained with DPO on a historical dataset, is the dataset iterative-prompt-v1-iter1-20K?

After we get \pi_1, we should use it to generate answers on iterative-prompt-v1-iter2-20K, labeled by the reward model, and then use run_dpo to get \pi_2 (with reference model still SFT model but start from \pi_1?). Thanks again!

WeiXiongUST commented 4 weeks ago

The SFT policy is LLaMA3-SFT but it is trained on a mixture of open-source datasets, as introduced in the appendix. The OpenHermes is also a good choice if you want to use. We do not include the offline dataset, but use only the data generated by the model. Specifically,

  1. with SFT model pi0, we use iterative-prompt-v1-iter1-20K to generate 8 responses per prompt, and use the RM to find the best v.s. the worst as a pair, and then run DPO on these collected pairs to get pi1;
  2. with pi1, we use iterative-prompt-v1-iter2-20K to generate 8 responses per prompt, and use the RM to find the best v.s. the worst as a pair, and then run DPO on these collected pairs to get pi2 ( we start with pi1, but still use pi0 as the reference model);
  3. ....
matouk98 commented 4 weeks ago

Thanks a lot for the response! I just have some questions about the hyperparameters in dpo training, basically some inconsistencies between report and github. According to the report, the learning rate is 5e-7 (2e-7 in ReadME), warmup ratio is 0.03 (0 in code but warmup step 100). Also the report does not mention the optimizer and weight_decay, should I just follow the default value in the code? Thanks again!

hendrydong commented 4 weeks ago

AdamW 5e-7 + cosine decay with 128 batch size is the hyperparams we used. (If you use 32 batch size, 2e-7 might be proper). weight_decay and warmup steps do not have significant impact on the final performance.

You may choose the report-suggested ones.

LotuSrc commented 5 days ago

Great works. For each iteration, train step is about 150 according to 20k samples with bs128. Three iterations means model update 450 steps in total. Does the comparison between dpo and online dpo under the same/comparable update steps? I've tried some experiments on larger models, however the online dpo/iterative dpo will drop performance compared with one iteration dpo. So I want to figure out whether some hyper params make the difference.