InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.8k stars 302 forks source link

qlora微调的模型是不支持中断后继续训练吗? #915

Open deep-practice opened 1 month ago

deep-practice commented 1 month ago

08/27 17:57:22 - mmengine - INFO - Resume checkpoint from /root/InternLM/work_dir/internvl_ft_trafficsign_multiround/iter_6000.pth Traceback (most recent call last): File "/root/InternLM/code/XTuner/xtuner/tools/train.py", line 360, in main() File "/root/InternLM/code/XTuner/xtuner/tools/train.py", line 356, in main runner.train() File "/root/.conda/envs/demo/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1195, in train self.load_or_resume() File "/root/.conda/envs/demo/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1141, in load_or_resume self.resume(resume_from) File "/root/.conda/envs/demo/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1456, in resume checkpoint = self.strategy.resume( File "/root/InternLM/code/XTuner/xtuner/engine/_strategy/deepspeed.py", line 60, in resume checkpoint = super().resume(*args, **kwargs) File "/root/.conda/envs/demo/lib/python3.10/site-packages/mmengine/strategy/deepspeed.py", line 472, in resume , extra_ckpt = self.model.load_checkpoint( File "/root/.conda/envs/demo/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2759, in load_checkpoint load_path, client_states = self._load_checkpoint(load_dir, File "/root/.conda/envs/demo/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2809, in _load_checkpoint sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list, checkpoint_engine=self.checkpoint_engine) File "/root/.conda/envs/demo/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 43, in get_sd_loader return MegatronSDLoader(ckpt_list, version, checkpoint_engine) File "/root/.conda/envs/demo/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 193, in init super().init(ckpt_list, version, checkpoint_engine) File "/root/.conda/envs/demo/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 55, in init self.check_ckpt_list() File "/root/.conda/envs/demo/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 168, in check_ckpt_list assert len(self.ckpt_list) > 0 AssertionError

Giserlei123 commented 1 month ago

请问你解决了么

deep-practice commented 1 month ago

没有