I tried to reproduce the complete example on a Hyperstack cloud machine (A100-80G-PCIe, OS Image Ubuntu Server 22.04 LTS, R535 CUDA 12.2). Since using a single A100, I reduced the batch size, this command starts the training:
python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28 gradient_accumulation_steps=2 batch_size=16 eval_batch_size=16 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16
Unfortunately, training fails when saving at the first checkpoint at 20000 examples with stack trace:
Error executing job with overrides: ['model=pythia28', 'datasets=[hh]', 'loss=sft', 'exp_name=anthropic_dpo_pythia28', 'gradient_accumulation_steps=2', 'batch_size=16', 'eval_batch_size=16', 'trainer=FSDPTrainer', 'sample_during_eval=false', 'model.fsdp_policy_mp=bfloat16']
Traceback (most recent call last):
File "/home/ubuntu/dpo-examples/direct-preference-optimization/train.py", line 111, in main
mp.spawn(worker_main, nprocs=world_size, args=(world_size, config, policy, reference_model), join=True)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/train.py", line 44, in worker_main
trainer.train()
File "/home/ubuntu/dpo-examples/direct-preference-optimization/trainers.py", line 352, in train
self.save(output_dir, mean_eval_metrics)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/trainers.py", line 501, in save
policy_state_dict = self.policy.state_dict()
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1722, in _save_to_state_dict
hook(self, prefix, keep_vars)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 669, in _pre_state_dict_hook
_pre_state_dict_hook_fn[fsdp_state._state_dict_type](
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 271, in _full_pre_state_dict_hook
_common_unshard_pre_state_dict_hook(
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 143, in _common_unshard_pre_state_dict_hook
_enter_unshard_params_ctx(
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 109, in _enter_unshard_params_ctx
fsdp_state._unshard_params_ctx[module].__enter__()
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 171, in _unshard_fsdp_state_params
_validate_unshard_params_args(
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 140, in _validate_unshard_params_args
raise NotImplementedError(
NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
It seems that either I installed the incompatible version of a library, or an incompatible library came with the preinstalled cloud image?
What is funny is that I tried to install only py dependencies available by 2023-06-22, which was the date of the last commit on requirements.txt, using pypi-timemachine, but it seems I still failed somewhere. Here are the versions on my cloud machine:
I tried to reproduce the complete example on a Hyperstack cloud machine (A100-80G-PCIe, OS Image Ubuntu Server 22.04 LTS, R535 CUDA 12.2). Since using a single A100, I reduced the batch size, this command starts the training:
python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28 gradient_accumulation_steps=2 batch_size=16 eval_batch_size=16 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16
Unfortunately, training fails when saving at the first checkpoint at 20000 examples with stack trace:
It seems that either I installed the incompatible version of a library, or an incompatible library came with the preinstalled cloud image? What is funny is that I tried to install only py dependencies available by 2023-06-22, which was the date of the last commit on requirements.txt, using pypi-timemachine, but it seems I still failed somewhere. Here are the versions on my cloud machine:
Does any of you good souls know what is wrong and which library and version is causing a problem?