DPO did not achieve the expected experimental effect

Vance0124 commented 7 months ago

I replicated the experiments of pythia28 on hh (Anthropic/hh-rlhf) using the open-source code. Here are some of the experimental results:

SFT1:

python -u train.py exp_name=sft gradient_accumulation_steps=1 batch_size=4 eval_batch_size=16 model.policy_dtype=float32

with the batch_size=4. Then I evaluate the model using GPT-4: sft3 But the result doesn't seem very promising. I also implemented three other versions:

SFT2:

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=32 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=float32

The result： sft4 Which also doesn't seem to be good.

SFT3（bfloat16）:

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=16 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=bfloat16

The result： sft1

SFT4（bfloat16）:

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=16 trainer=BasicTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=bfloat16

The result： sft2

The evaluation of SFT3（bfloat16） and SFT4(bfloat16) seem to be even worse.

Based on The SFT1 (which I think is probably the best among these), I trained the DPO1 with the following command:

python -u train.py loss=dpo loss.beta=0.1 exp_name=DPO_pythia28 gradient_accumulation_steps=1 batch_size=4 eval_batch_size=16 model.policy_dtype=float32 model.archive=.cache/yanxue/DPO_SFT_2023-11-25_20-19-34_103923/LATEST/policy.pt

The result： DPO1 In which the highest winning rate is only 50%，but I don't what I did wrong or missed something.

I also implemented another result for DPO based on SFT2： DPO2:

python -u train.py model=pythia28 datasets=[hh] loss=dpo loss.beta=0.1 exp_name=anthropic_dpo_pythia28_dpo model.policy_dtype=bfloat16 gradient_accumulation_steps=16 batch_size=64 eval_batch_size=16 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.archive=.cache/root/anthropic_dpo_pythia28_sft_2023-12-03_15-39-23_344965/LATEST/policy.pt

Compared to the training results on the open WandB, the experimental results I ran myself did not meet expectations.

The training results "rewards_train/accuracies" on the open WandB: wandb_train While the training results "rewards_train/accuracies" of mine: result_train The "rewards_train/accuracies" on the open WandB can even reach more than 70%，but the "rewards_train/accuracies" of mine could only achieve around 60% at most. And the evaluating results: "rewards_eval/accuracies" on the open WandB: wandb_eval The "rewards_eval/accuracies" of mine: result_eval_dpo2 The eval result also have a gap (getting close to 10%). Other parameters use the default parameters (e.g. lr=5e-7). I'm not sure if I made a mistake somewhere or what needs to be modified to bridge this gap. Please help me, and I sincerely appreciate it.

Vance0124 commented 7 months ago

My main issue is:

Is there any mistake in the training method for my SFT model (float32)?
Is it normal for the evaluation performance of the SFT model (bfloat16) to be much worse than the evaluation performance of the SFT model (float32)? Is there a way to compensate for this? Also, if the DPO model trained based on such a SFT model (bfloat16) has poor performance, is there a way to remedy it?
Is there any method or parameter adjustment to bridge the gap between DPO and the displayed results?

eric-mitchell commented 6 months ago

Changing the fsdp_policy_dtype when not using the FSDPTrainer will have no effect. The reference wandb run used the command:

train.py model=pythia28 datasets=[hh] loss=sft exp_name=pythia28_hh_sft_bf16 gradient_accumulation_steps=2 batch_size=64 n_epochs=1 eval_batch_size=32 trainer=FSDPTrainer eval_every=5000 sample_during_eval=false model.fsdp_policy_mp=bfloat16

It's possible this has to do with an interaction between TensorParallelTrainer and reduced precision. Can you try again with the FSDPTrainer?
Can you try again with the exact commands used in the demo runs? https://wandb.ai/eric_anthony_mitchell/dpo-demos

eric-mitchell / direct-preference-optimization

DPO did not achieve the expected experimental effect #56