DPO degeneration problem

NJUNLP / MAPO

The implement of ACL2024: "MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization"

28 stars 4 forks source link

Hello! Thank you for your work; I have some technical issues I'd like to discuss with you. I noticed that you mentioned encountering a problem where the model repeatedly generated the same token after DPO training. We are experiencing a similar issue in our current experiments. Could you share how you resolved this issue?

You can put those bad sampling output with repetition to "Reject" to penalize this behavior
You can also try to lower the learning rate

And recently, I also try to mix dpo-loss and sft-loss, which may help to get a stable training reuslt

NJUNLP / MAPO

DPO degeneration problem #3