eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.06k stars 167 forks source link

Question about fine tuning steps(epoch) #58

Closed gyuwon12 closed 9 months ago

gyuwon12 commented 9 months ago

Hello, I'm going to learn the SFT model with instruction tuning through DPO. Looking at various examples(https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/README.md) and papers, the fine tuning step varies not just a few times but a few hundred times. Is there no performance to turn step 2 and 3?

Thank you.

eric-mitchell commented 9 months ago

Are you asking if we can merge the SFT and DPO steps? Typically doing SFT first is helpful, as DPO only learns the "delta" between the chosen and rejected examples, but if there are (useful) behaviors that appear in both, then DPO won't learn them; that's why we need SFT first. Feel free to re-open if this doesn't answer your question.

OnceJune commented 5 months ago

@eric-mitchell Hi, from the code the chosen response comes from fp32 forward, and the rejected response comes from fp16 forward, so I don't need a human labeled dataset, instead run the forward twice (one time fp32 and one time fp16) is enough, is that right?