Yifan-Song793 / ETO

Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)
https://arxiv.org/abs/2403.02502
96 stars 11 forks source link

DPO formula question #9

Closed nighty8 closed 3 months ago

nighty8 commented 3 months ago

Hello Team!

I am wondering if the DPO loss formula is wrong in your paper.

微信图片_20240803085605

(your paper)

1722646629975

("Direct Preference Optimization:Your Language Model is Secretly a Reward Model")

The two formulas are different when expanded.Besides,I think the ref model is dedigned to limit the base model instead of the formula you write in this paper.

Thank you in advance for your time and assistance.I am looking forward to your insights on this matter.

nighty8 commented 3 months ago

I made a big mistake.Sorry again for troubling you,