When I run dpo with FSDPTrainer and sample_during_eval, it just got stuck with the following info:
Processing HH: 100%|█████████████████| 160800/160800 [00:04<00:00, 39073.07it/s]
Running evaluation after 0 train examples
Computing eval metrics: 100%|█████████████████████| 8/8 [00:04<00:00, 1.71it/s]
Warning: n_eval_model_samples (16) < eval_batch_size (32). Sampling from the first complete eval batch of prompts.
Generating samples...: 0%| | 0/1 [00:00<?, ?it/s]
Concretely, it got stuck when doing model.generate(). Several issues report some thing wrong with FSDP + HuggingFace generate (also mentioned in your code annotation: https://github.com/pytorch/pytorch/issues/100069). I want to check whether you came across this situation and how to handel it?
Looking forward to your response, thanks in advance!
Thanks for sharing this repo.
When I run dpo with FSDPTrainer and sample_during_eval, it just got stuck with the following info:
Processing HH: 100%|█████████████████| 160800/160800 [00:04<00:00, 39073.07it/s] Running evaluation after 0 train examples Computing eval metrics: 100%|█████████████████████| 8/8 [00:04<00:00, 1.71it/s] Warning: n_eval_model_samples (16) < eval_batch_size (32). Sampling from the first complete eval batch of prompts. Generating samples...: 0%| | 0/1 [00:00<?, ?it/s]
Concretely, it got stuck when doing model.generate(). Several issues report some thing wrong with FSDP + HuggingFace generate (also mentioned in your code annotation: https://github.com/pytorch/pytorch/issues/100069). I want to check whether you came across this situation and how to handel it?
Looking forward to your response, thanks in advance!