Closed UbeCc closed 4 days ago
Hi! I'm training a DPO model on my dataset, but I find the reward is always 0. So I cloned the OpenRLHF repo again and did not modify any source code. The only thing I changed is train_dpo.sh (demo script). And I found the reward was still always 0.
How could I solve this? Thanks!
set -x deepspeed \ --include=localhost:0,1,2,3,4,5,6,7 \ ../train_dpo.py \ --save_path ./checkpoint/llama3-8b-dpo \ --save_steps -1 \ --logging_steps 1 \ --eval_steps -1 \ --train_batch_size 256 \ --micro_train_batch_size 1 \ --pretrain mistralai/Mistral-7B-v0.3 \ --bf16 \ --max_epochs 1 \ --max_len 8192 \ --zero_stage 3 \ --learning_rate 9e-6 \ --beta 0.1 \ --dataset OpenLLMAI/preference_dataset_mixture2_and_safe_pku\ --apply_chat_template \ --chosen_key chosen \ --rejected_key rejected \ --gradient_checkpointing
I don't know why, but the reward is zero at warming-up phases. How long it will take depends on the dataset
Hi! I'm training a DPO model on my dataset, but I find the reward is always 0. So I cloned the OpenRLHF repo again and did not modify any source code. The only thing I changed is train_dpo.sh (demo script). And I found the reward was still always 0.
How could I solve this? Thanks!