NJUNLP / MAPO

The implement of ACL2024: "MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization"
28 stars 4 forks source link

DPO degeneration problem #3

Closed WJMacro closed 4 months ago

WJMacro commented 4 months ago

Hello! Thank you for your work; I have some technical issues I'd like to discuss with you. I noticed that you mentioned encountering a problem where the model repeatedly generated the same token after DPO training. We are experiencing a similar issue in our current experiments. Could you share how you resolved this issue?

Ricardokevins commented 4 months ago

Hello! Thank you for your work; I have some technical issues I'd like to discuss with you. I noticed that you mentioned encountering a problem where the model repeatedly generated the same token after DPO training. We are experiencing a similar issue in our current experiments. Could you share how you resolved this issue?

  1. You can put those bad sampling output with repetition to "Reject" to penalize this behavior
  2. You can also try to lower the learning rate

And recently, I also try to mix dpo-loss and sft-loss, which may help to get a stable training reuslt