Open matouk98 opened 4 weeks ago
The SFT policy is LLaMA3-SFT but it is trained on a mixture of open-source datasets, as introduced in the appendix. The OpenHermes is also a good choice if you want to use. We do not include the offline dataset, but use only the data generated by the model. Specifically,
Thanks a lot for the response! I just have some questions about the hyperparameters in dpo training, basically some inconsistencies between report and github. According to the report, the learning rate is 5e-7 (2e-7 in ReadME), warmup ratio is 0.03 (0 in code but warmup step 100). Also the report does not mention the optimizer and weight_decay, should I just follow the default value in the code? Thanks again!
AdamW 5e-7 + cosine decay with 128 batch size is the hyperparams we used. (If you use 32 batch size, 2e-7 might be proper). weight_decay and warmup steps do not have significant impact on the final performance.
You may choose the report-suggested ones.
Great works. For each iteration, train step is about 150 according to 20k samples with bs128. Three iterations means model update 450 steps in total. Does the comparison between dpo and online dpo under the same/comparable update steps? I've tried some experiments on larger models, however the online dpo/iterative dpo will drop performance compared with one iteration dpo. So I want to figure out whether some hyper params make the difference.
I have some questions about the iterative pipeline. Please correct me if my understanding is wrong, thank you so much!
From the report, \pi_0 should be the SFT policy trained on SFT-OpenHermes-2.5-Standard, (LLaMA3-SFT I guess?), and \pi_1 is the policy further trained with DPO on a historical dataset, is the dataset iterative-prompt-v1-iter1-20K?
After we get \pi_1, we should use it to generate answers on iterative-prompt-v1-iter2-20K, labeled by the reward model, and then use run_dpo to get \pi_2 (with reference model still SFT model but start from \pi_1?). Thanks again!