eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.18k stars 180 forks source link

In DPO training, I got this ‘train stats after 160768 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l -ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4876', 'grad_norm': 'nan', 'counters/examples': 160768, 'counters/up -dates': 5024}’ #89

Open Alan-D-Chen opened 1 month ago

Alan-D-Chen commented 1 month ago
★---> train stats after 160768 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l    -ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4876', 'grad_norm': 'nan', 'counters/examples': 160768, 'counters/up   -dates': 5024}
★---> train stats after 160800 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l  2ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4887', 'grad_norm': 'nan', 'counters/examples': 160800, 'counters/updates': 5025}

I have trained in SFT, and got the XXX.pt file. What is wrong with this staff ?

Alan-D-Chen commented 1 month ago

And my SFT is not convergent, what is more, the DPO even just do not work.

In SFT, I run: python -u train.py model=pythia69 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=64 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false

but the results are image.png

In DPO, I run: python -u train.py model=llama7b model.name_or_path=/workspace/sa/L20_TEST/LLM_models/llama2/hf_7B/ datasets=[hh] loss=dpo loss.beta=0.1 exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=64 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.archive=.cache/root/anthropic_dpo_pythia69_2024-09-14_10-30-03_907191/step-159744/policy.pt

But the results are `★---> train stats after 160768 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l -ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4876', 'grad_norm': 'nan', 'counters/examples': 160768, 'counters/up -dates': 5024}

★---> train stats after 160800 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 2ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4887', 'grad_norm': 'nan', 'counters/examples': 160800, 'counters/updates': 5025}

Finished generating 1 epochs on train split writing checkpoint to .cache/root/anthropic_dpo_pythia69_2024-09-22_09-38-57_157738/LATEST/policy.pt... [rank0]:[2024-09-22 18:41:34,834] torch.distributed.fsdp._debug_utils: [WARNING] FSDP _optim_state_dict() profiling: defaultdict(<class 'float'>, {'preprocessing': 0.012136668432503939, 'preprocessing_with_comm': 0.042172754649072886, 'state_converting': 13.85691294586286, <Type.ALL: 'all'>: 13.912817124743015}) writing checkpoint to .cache/root/anthropic_dpo_pythia69_2024-09-22_09-38-57_157738/LATEST/optimizer.pt... writing checkpoint to .cache/root/anthropic_dpo_pythia69_2024-09-22_09-38-57_157738/LATEST/scheduler.pt... wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync .cache/root/wandb/offline-run-20240922_094100-59mk2v5i`

Alan-D-Chen commented 1 month ago

@eric-mitchell Dear, would you like do me a favor ?Can you help me ? HUGE thanks for you.