eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.06k stars 167 forks source link

What's the reference policy of Preferred-FT in Figure 2? #70

Open zetian1025 opened 7 months ago

zetian1025 commented 7 months ago

First of all, thanks for your great work! I'm confused as you said you use the SFT model or Preferred-FT model as the reference policy when operating DPO training. But for Preferred-FT in Figure 2, what's its reference policy? Or how the KL-Divergence is computed, Is the reference policy aligned? 1709556574829