I am wondering if the DPO loss formula is wrong in your paper.
(your paper)
("Direct Preference Optimization:Your Language Model is Secretly a Reward Model")
The two formulas are different when expanded.Besides,I think the ref model is dedigned to limit the base model instead of the formula you write in this paper.
Thank you in advance for your time and assistance.I am looking forward to your insights on this matter.
Hello Team!
I am wondering if the DPO loss formula is wrong in your paper.
(your paper)
("Direct Preference Optimization:Your Language Model is Secretly a Reward Model")
The two formulas are different when expanded.Besides,I think the ref model is dedigned to limit the base model instead of the formula you write in this paper.
Thank you in advance for your time and assistance.I am looking forward to your insights on this matter.