HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
Apache License 2.0
208 stars 18 forks source link

error when use zero1 #37

Open bebory opened 1 year ago

bebory commented 1 year ago

Traceback (most recent call last): File "train.py", line 131, in <module> main() File "train.py", line 109, in main engine.load_checkpoint(model_args.init_ckpt,load_module_only=True)#load_module_only=True File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2769, in load_checkpoint success = self._load_zero_checkpoint( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2948, in _load_zero_checkpoint zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3042, in _get_all_zero_checkpoints return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3014, in _get_all_zero_checkpoint_state_dicts _state = self.checkpoint_engine.load( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in load partition = torch.load(path, map_location=map_location) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 699, in load with _open_file_like(f, 'rb') as opened_file: File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 231, in _open_file_like return _open_file(name_or_buffer, mode) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 212, in __init__ super(_open_file, self).__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './llama-7B-init-ckpt/global_step001/zero_pp_rank_0_mp_rank_01_optim_states.pt' [2023-08-13 20:35:08,552] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from ./llama-7B-init-ckpt/global_step001/zero_pp_rank_0_mp_rank_02_optim_states.pt...

HuangLK commented 1 year ago

Reinstall DeepSpeed with a specific version and then try again :)