Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 421 forks source link

bash finetune_continue.sh failed with 'RuntimeError: Trainer requires either a model or model_init argument' #123

Open SeekPoint opened 1 year ago

SeekPoint commented 1 year ago

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError("Trainer requires either a model or model_init │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trainer requires either a model or model_init argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.sh

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError("Trainer requires either a model or model_init │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trainer requires either a model or model_init argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):

Traceback (most recent call last): File "/home/ub2004/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-04-29_00:06:42 host : ub2004-B85M-A0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 26013) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Facico commented 1 year ago

This error looks like the mod was not successfully loaded.You can try the third one in the code here, whose title is "输出乱码问题". You can use this code to check if you can load model properly

YSLLYW commented 1 year ago

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.shTrainer``model``model_init``Trainer``model``model_init

===================================BUG REPORT===================================

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):Trainer``model``model_init``Trainer``model``model_init

回溯(最近一次调用):

文件 “/home/ub2004/.local/bin/torchrun”,第 8 行,在 sys.exit(main()) 文件中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”,第 346 行,在包装器 中返回 f(*args, **kwargs) 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 794 行,在 main run(args)

中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 785 行,在运行 中 elastic_launch( 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 134 行,在调用 返回launch_agent(self._config, self._entrypoint, list(args))

文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 250 行,launch_agent 引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py 失败

失败:

# 根本原因(第一次观察到的故障): [0]: 时间 : 2023-04-29_00:06:42 主机 : ub2004-B85M-A0 等级 : 0 (local_rank: 0) 退出代码 : 1 (PID: 26013) error_file: <不适用> 回溯 : 要启用回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html

请问这个问题您解决了嘛