First of all, thanks for your great work!
I'm confused as you said you use the SFT model or Preferred-FT model as the reference policy when operating DPO training.
But for Preferred-FT in Figure 2, what's its reference policy? Or how the KL-Divergence is computed, Is the reference policy aligned?
First of all, thanks for your great work! I'm confused as you said you use the SFT model or Preferred-FT model as the reference policy when operating DPO training. But for Preferred-FT in Figure 2, what's its reference policy? Or how the KL-Divergence is computed, Is the reference policy aligned?