多卡环境遇到一个报错信息

schzyf commented 5 months ago

[WARNING|logging.py:329] 2024-06-14 18:45:29,004 >> Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. [WARNING|logging.py:329] 2024-06-14 18:45:29,004 >> Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. [WARNING|logging.py:329] 2024-06-14 18:45:29,004 >> Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. [WARNING|logging.py:329] 2024-06-14 18:45:29,005 >> Unsloth 2024.6 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers. Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. Unsloth 2024.6 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers. 06/14/2024 18:45:29 - INFO - llamafactory.model.loader - trainable params: 3407872 || all params: 8033669120 || trainable%: 0.0424 [INFO|trainer.py:641] 2024-06-14 18:45:29,957 >> Using auto half precision backend [WARNING|logging.py:329] 2024-06-14 18:45:30,297 >> * Our OSS was designed for people with few GPU resources to level the playing field.

The OSS Apache 2 license only supports one GPU - please obtain a commercial license.
We're a 2 person team, so we still have to fund our development costs - thanks!
If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it! 06/14/2024 18:45:30 - INFO - llamafactory.model.loader - trainable params: 3407872 || all params: 8033669120 || trainable%: 0.0424
Our OSS was designed for people with few GPU resources to level the playing field.
The OSS Apache 2 license only supports one GPU - please obtain a commercial license.
We're a 2 person team, so we still have to fund our development costs - thanks!
If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it! ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 2 \ /| Num examples = 491 | Num Epochs = 10 O^O/ _/ \ Batch size per device = 1 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 300 "--" Number of trainable parameters = 3,407,872 [WARNING|:220] 2024-06-14 18:45:31,705 >> ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 2 \ /| Num examples = 491 | Num Epochs = 10 O^O/ _/ \ Batch size per device = 1 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 300 "--" Number of trainable parameters = 3,407,872 Traceback (most recent call last): File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/launcher.py", line 9, in launch() File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/launcher.py", line 5, in launch run_exp() File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "", line 226, in _fast_inner_training_loop RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license. Traceback (most recent call last): File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/launcher.py", line 9, in launch() File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/launcher.py", line 5, in launch run_exp() File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/root/miniconda3/lib/python3.10/site-packages/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/root/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "", line 226, in _fast_inner_training_loop RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license. [2024-06-14 18:45:38,183] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7755) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/miniconda3/lib/python3.10/site-packages/llamafactory/launcher.py FAILED

这个怎么处理呢

echonoshy commented 5 months ago

看起来像是unsloth不兼容导致的，你训练的时候不要用unsloth做优化。

schzyf commented 4 months ago

多卡环境下 use_unsloth 这个改成false了，就正常了

echonoshy / cgft-llm

多卡环境遇到一个报错信息 #3

/root/miniconda3/lib/python3.10/site-packages/llamafactory/launcher.py FAILED