bf16
While training, the loss decreases very fast but after the first epoch, the logits of both chosen and rejected decreases to 0 and model suffer from degeneration (it generates repeated character `) after 1 epoch.
Here is the full logs of the training process and a sample output of model, you can read more in column "PhoGPT-4B-Chat-SPIN-0-4K-one-turn-ep1" in the attached google sheet:
Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is vinai/PhoGPT-4B-Chat, and follow the method described in CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. to make preference dataset from my own SFT dataset. I use trl for traninig with the config:
Do you have any suggest for this problem?