eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.18k stars 180 forks source link

Possible Inconsistency(Possibly Typo) in Gradient Definition Between Eq. 7 and Appendix A.4 in DPO paper #60

Closed rustic-snob closed 10 months ago

rustic-snob commented 11 months ago

Hi,

I am Jaewon Cheon, currently delving into NLP and LLM studies in South Korea. I would like to begin by extending my heartfelt appreciation for your pioneering work on DPO. It has profoundly impacted the academic sphere and offered an efficient method for many practitioners to locally tune LLMs to their preferences, circumventing the often arduous task of RL training and its complex prerequisites.

During my thorough examination of your paper, I believe I have stumbled upon a potential oversight concerning the notation in the gradient definition.

In Appendix A.4, specifically Eq. 21, there seems to be an inversion in the order of the terms $y{w}$ and $y{l}$ when compared to Eq. 7 and the subsequent gradient discussion in the main text. While Eq. 7 assigns a positive sign to $y{w}$ and a negative sign to $y{l}$, this order appears reversed in the definition of function 'u' and the final gradient in Appendix A.4. This discrepancy might lead to confusion in the implementation of the DPO algorithm.

Although this may be a minor detail, I thought it prudent to raise it to your attention for clarification, ensuring the accuracy and clarity of the paper's methodology.

Thank you once again for your remarkable contribution to the field.

Kind regards, Jaewon Cheon

eric-mitchell commented 10 months ago

Thanks for pointing this out, Jaewon! I believe we had this reversal in the main text of the paper in an earlier version, and forgot to also fix it when we fixed it in the main text. We'll update for the next revision of the paper.