reward is always 0 when training DPO

Hi! I'm training a DPO model on my dataset, but I find the reward is always 0. So I cloned the OpenRLHF repo again and did not modify any source code. The only thing I changed is train_dpo.sh (demo script). And I found the reward was still always 0.

How could I solve this? Thanks!

set -x 

deepspeed \
     --include=localhost:0,1,2,3,4,5,6,7 \
     ../train_dpo.py \
     --save_path ./checkpoint/llama3-8b-dpo \
     --save_steps -1 \
     --logging_steps 1 \
     --eval_steps -1 \
     --train_batch_size 256 \
     --micro_train_batch_size 1 \
     --pretrain mistralai/Mistral-7B-v0.3 \
     --bf16 \
     --max_epochs 1 \
     --max_len 8192 \
     --zero_stage 3 \
     --learning_rate 9e-6 \
     --beta 0.1 \
     --dataset OpenLLMAI/preference_dataset_mixture2_and_safe_pku\
     --apply_chat_template \
     --chosen_key chosen \
     --rejected_key rejected \
     --gradient_checkpointing

OpenLLMAI / OpenRLHF

reward is always 0 when training DPO #345