eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.06k stars 167 forks source link

Possible Inconsistency(Possibly Typo) in Gradient Definition Between Eq. 7 and Appendix A.4 in DPO paper #60

Closed rustic-snob closed 9 months ago

rustic-snob commented 9 months ago

Hi,

I am Jaewon Cheon, currently delving into NLP and LLM studies in South Korea. I would like to begin by extending my heartfelt appreciation for your pioneering work on DPO. It has profoundly impacted the academic sphere and offered an efficient method for many practitioners to locally tune LLMs to their preferences, circumventing the often arduous task of RL training and its complex prerequisites.

During my thorough examination of your paper, I believe I have stumbled upon a potential oversight concerning the notation in the gradient definition.

In Appendix A.4, specifically Eq. 21, there seems to be an inversion in the order of the terms $y{w}$ and $y{l}$ when compared to Eq. 7 and the subsequent gradient discussion in the main text. While Eq. 7 assigns a positive sign to $y{w}$ and a negative sign to $y{l}$, this order appears reversed in the definition of function 'u' and the final gradient in Appendix A.4. This discrepancy might lead to confusion in the implementation of the DPO algorithm.

Although this may be a minor detail, I thought it prudent to raise it to your attention for clarification, ensuring the accuracy and clarity of the paper's methodology.

Thank you once again for your remarkable contribution to the field.

Kind regards, Jaewon Cheon

eric-mitchell commented 9 months ago

Thanks for pointing this out, Jaewon! I believe we had this reversal in the main text of the paper in an earlier version, and forgot to also fix it when we fixed it in the main text. We'll update for the next revision of the paper.