Open SeekPoint opened 1 year ago
This error looks like the mod was not successfully loaded.You can try the third one in the code here, whose title is "输出乱码问题". You can use this code to check if you can load model properly
Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.sh
Trainer``model``model_init``Trainer``model``model_init
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):
Trainer``model``model_init``Trainer``model``model_init
回溯(最近一次调用):
文件 “/home/ub2004/.local/bin/torchrun”,第 8 行,在 sys.exit(main()) 文件中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”,第 346 行,在包装器 中返回 f(*args, **kwargs) 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 794 行,在 main run(args)
中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 785 行,在运行 中 elastic_launch( 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 134 行,在调用 返回launch_agent(self._config, self._entrypoint, list(args))
文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 250 行,launch_agent 引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py 失败
失败:
# 根本原因(第一次观察到的故障): [0]: 时间 : 2023-04-29_00:06:42 主机 : ub2004-B85M-A0 等级 : 0 (local_rank: 0) 退出代码 : 1 (PID: 26013) error_file: <不适用> 回溯 : 要启用回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html
请问这个问题您解决了嘛
Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │
│ │
│ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │
│ 233 │ val_data = None │
│ 234 │
│ ❱ 235 trainer = transformers.Trainer( │
│ 236 │ model=model, │
│ 237 │ train_dataset=train_data, │
│ 238 │ eval_dataset=val_data, │
│ │
│ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │
│ │
│ 353 │ │ │ │ self.model_init = model_init │
│ 354 │ │ │ │ model = self.call_model_init() │
│ 355 │ │ │ else: │
│ ❱ 356 │ │ │ │ raise RuntimeError("
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in
Trainer
requires either amodel
ormodel_init
│ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError:Trainer
requires either amodel
ormodel_init
argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.sh===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │
│ │
│ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │
│ 233 │ val_data = None │
│ 234 │
│ ❱ 235 trainer = transformers.Trainer( │
│ 236 │ model=model, │
│ 237 │ train_dataset=train_data, │
│ 238 │ eval_dataset=val_data, │
│ │
│ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │
│ │
│ 353 │ │ │ │ self.model_init = model_init │
│ 354 │ │ │ │ model = self.call_model_init() │
│ 355 │ │ │ else: │
│ ❱ 356 │ │ │ │ raise RuntimeError("
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in
Trainer
requires either amodel
ormodel_init
│ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError:Trainer
requires either amodel
ormodel_init
argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):Traceback (most recent call last): File "/home/ub2004/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures: