SeekPoint commented 1 year ago

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError("Trainer requires either a model or model_init │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trainer requires either a model or model_init argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.sh

issues

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError("Trainer requires either a model or model_init │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trainer requires either a model or model_init argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):

Traceback (most recent call last): File "/home/ub2004/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-04-29_00:06:42 host : ub2004-B85M-A0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 26013) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

Facico commented 1 year ago

This error looks like the mod was not successfully loaded.You can try the third one in the code here, whose title is "输出乱码问题". You can use this code to check if you can load model properly

YSLLYW commented 1 year ago

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.shTrainer``model``model_init``Trainer``model``model_init

===================================BUG REPORT===================================

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):Trainer``model``model_init``Trainer``model``model_init

回溯（最近一次调用）：

文件 “/home/ub2004/.local/bin/torchrun”，第 8 行，在 sys.exit（main（））文件中文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”，第 346 行，在包装器中返回 f（*args， **kwargs）文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”，第 794 行，在 main run（args）

中文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”，第 785 行，在运行中 elastic_launch（文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”，第 134 行，在调用返回launch_agent（self._config， self._entrypoint， list（args））

文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”，第 250 行，launch_agent 引发 ChildFailedError（ torch.distributed.elastic.multiprocessing.errors.ChildFailedError：

finetune.py 失败

失败：
# 根本原因（第一次观察到的故障）： [0]：时间： 2023-04-29_00：06：42 主机： ub2004-B85M-A0 等级： 0 （local_rank： 0）退出代码： 1 （PID： 26013） error_file： <不适用> 回溯：要启用回溯，请参阅： https://pytorch.org/docs/stable/elastic/errors.html

请问这个问题您解决了嘛

Facico / Chinese-Vicuna

bash finetune_continue.sh failed with 'RuntimeError: Trainer requires either a model or model_init argument' #123

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

finetune.py FAILED

===================================BUG REPORT===================================

回溯（最近一次调用）：

finetune.py 失败

失败：