Open hong-xl opened 4 weeks ago
For the first question, it is because DPO with Nectar performs better.
For the second one, if you can run DPO with llama3-70b, it would be OK. Our dpo implementation is largely built on the great work of huggingface trl, you may refer to their repo for more information. In general, you should try to use Gradient-checkpointing, and deepspeed stage 3. You may also need to register the API to inference (the default choice of this recipe) instead of using 8 independent jobs for 8 models.
Thanks for your response. Have you tried using all the data from Table 1 to train offline DPO? Would it result in better performance? Based on your experience, what kind of data is suitable for training DPO?
This is a good question. The dataset that is good for reward modeling may not be good for DPO.
For instance, we use HH-RLHF in reward modeling because it allows the model to evaluate the multi-turn conversation. But it is well known (as verified by > 10 research paper) that DPO trained on HH-RLHF is bad. The nectar re-label these datasets using strong and diverse LLMs to get the responses and is more suitable for DPO training.
This is a good question. The dataset that is good for reward modeling may not be good for DPO.
For instance, we use HH-RLHF in reward modeling because it allows the model to evaluate the multi-turn conversation. But it is well known (as verified by > 10 research paper) that DPO trained on HH-RLHF is bad. The nectar re-label these datasets using strong and diverse LLMs to get the responses and is more suitable for DPO training.
Thanks for the advice. Could you give some of the research papers discussing about "DPO trained on HH-RLHF is bad" ? I am curious about the reason here.
see https://arxiv.org/pdf/2309.06657
The vanilla offline DPO largely depends on the data quality. On-policy sampling and online data annotation are the key to the success. This can be shown in a more rigorous way, see theorem 2 and related discussions in https://arxiv.org/pdf/2312.11456.
Indeed, this is exactly why we wrote this paper: to call for online RLHF instead of distilling GPT4 with offline DPO.
Hi,I have some questions about dpo:
Thanks for your assistance.