eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.15k stars 174 forks source link

Weird logits and model starts degeneration while training DPO #77

Open DungNasSa10 opened 7 months ago

DungNasSa10 commented 7 months ago

Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is vinai/PhoGPT-4B-Chat, and follow the method described in CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. to make preference dataset from my own SFT dataset. I use trl for traninig with the config:

Do you have any suggest for this problem?

AGTSAAA commented 6 months ago

Hi, Did you solve the problem?

ggoggam commented 5 months ago

This seems to be a problem with DeepSpeed ZeRO 3. If I use FSDP, everything works fine.

I tried using torch's AdamW instead of DS FusedAdam, the problem persists.