allenai / open-instruct

Apache License 2.0
1.42k stars 179 forks source link

Problems about resuming from checkpoint for finetune_with_lora #330

Open ypwang61 opened 2 months ago

ypwang61 commented 2 months ago

Hi, thanks for your great job. I found some errors when I wanted to continue my LoRA finetuning.

09/04/2024 15:54:23 - INFO - accelerate.accelerator - Loading states from output/tulu_v2_dolly_openorca_1M_v2_64_7B_lora/step_15600
09/04/2024 15:54:23 - INFO - accelerate.accelerator - Loading DeepSpeed Model and Optimizer
[rank0]: Traceback (most recent call last):
[rank0]:   File "/homes/gws/ypwang61/Research/LLM/open-instruct/open_instruct/finetune.py", line 682, in <module>
[rank0]:     main()
[rank0]:   File "/homes/gws/ypwang61/Research/LLM/open-instruct/open_instruct/finetune.py", line 573, in main
[rank0]:     accelerator.load_state(checkpoint_path)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/accelerate/accelerator.py", line 3064, in load_state
[rank0]:     model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2759, in load_checkpoint
[rank0]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2809, in _load_checkpoint
[rank0]:     sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list, checkpoint_engine=self.checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 43, in get_sd_loader
[rank0]:     return MegatronSDLoader(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 193, in __init__
[rank0]:     super().__init__(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 55, in __init__
[rank0]:     self.check_ckpt_list()
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 168, in check_ckpt_list
[rank0]:     assert len(self.ckpt_list) > 0
[rank0]: AssertionError

This is what I have in the step_15600 directory:

adapter_config.json  adapter_model.safetensors  README.md
hamishivi commented 2 months ago

Hi! The code doesn't currently support resuming training when doing LoRA training. We should add support for this (often internally we usually just full-finetune). Feel free to help with adding, otherwise it might take a little time to get to this, due to some upcoming deadlines. I'll leave the issue open to track this.