RLHFlow / Online-RLHF

A recipe for online RLHF.
https://rlhflow.github.io/
310 stars 37 forks source link

questions about dpo #8

Open hong-xl opened 4 weeks ago

hong-xl commented 4 weeks ago

Hi,I have some questions about dpo:

  1. Is there any reason why choosing Nectar dataset to train offline vanilla dpo rather than using the same dataset as iterative dpo, for a possibly more fair comparison?
  2. 3.Have you applied iterative dpo to llama3-70b? When applying iterative dpo to Llama3-70B, what specific details should be paid attention to?

Thanks for your assistance.

WeiXiongUST commented 4 weeks ago

For the first question, it is because DPO with Nectar performs better.

For the second one, if you can run DPO with llama3-70b, it would be OK. Our dpo implementation is largely built on the great work of huggingface trl, you may refer to their repo for more information. In general, you should try to use Gradient-checkpointing, and deepspeed stage 3. You may also need to register the API to inference (the default choice of this recipe) instead of using 8 independent jobs for 8 models.

hong-xl commented 4 weeks ago

Thanks for your response. Have you tried using all the data from Table 1 to train offline DPO? Would it result in better performance? Based on your experience, what kind of data is suitable for training DPO?

WeiXiongUST commented 4 weeks ago

This is a good question. The dataset that is good for reward modeling may not be good for DPO.

For instance, we use HH-RLHF in reward modeling because it allows the model to evaluate the multi-turn conversation. But it is well known (as verified by > 10 research paper) that DPO trained on HH-RLHF is bad. The nectar re-label these datasets using strong and diverse LLMs to get the responses and is more suitable for DPO training.

loss4Wang commented 2 weeks ago

This is a good question. The dataset that is good for reward modeling may not be good for DPO.

For instance, we use HH-RLHF in reward modeling because it allows the model to evaluate the multi-turn conversation. But it is well known (as verified by > 10 research paper) that DPO trained on HH-RLHF is bad. The nectar re-label these datasets using strong and diverse LLMs to get the responses and is more suitable for DPO training.

Thanks for the advice. Could you give some of the research papers discussing about "DPO trained on HH-RLHF is bad" ? I am curious about the reason here.

WeiXiongUST commented 2 weeks ago

see https://arxiv.org/pdf/2309.06657

The vanilla offline DPO largely depends on the data quality. On-policy sampling and online data annotation are the key to the success. This can be shown in a more rigorous way, see theorem 2 and related discussions in https://arxiv.org/pdf/2312.11456.

Indeed, this is exactly why we wrote this paper: to call for online RLHF instead of distilling GPT4 with offline DPO.