Closed Tian14267 closed 1 year ago
你之前好像问过类似的问题,当时说的情况你解决了吗。 是在不行你可以试试continue_finetune的脚本,从断掉的上一个checkpoint重新开始试试
你下次跑的时候可以在CUDA_VISIBLE_DEVICES 前面加一个TORCH_DISTRIBUTED_DEBUG=DETAIL,或者直接加在bash脚本前面,可以看到更详细一点的报错信息
@Facico 之前遇到了类似问题,但是最近在忙别的事。这两天在解决generate
的问题之后,重新尝试了训练。发现仍然是该问题。初步检查了你说的情况,首先不是手动退出的问题,因为加了 nohup
。其次CPU应该也不至于。服务器就只跑了这一个任务,数据量也就40万条左右,双卡训练,CPU负荷应该不大。由于并没有提示其他错误信息,所以实在不知道是啥问题。不知道大神你们能复现出吗。训练数据 是 Belle-0.5M
你下次跑的时候可以在CUDA_VISIBLE_DEVICES 前面加一个TORCH_DISTRIBUTED_DEBUG=DETAIL,或者直接加在bash脚本前面,可以看到更详细一点的报错信息
我试试
@Facico 大神,我添加了 TORCH_DISTRIBUTED_DEBUG=DETAIL
之后再次遇到该问题:
我就很纳闷了。不知道啥情况。
另外!我在启动训练的时候,会经常遇到这个问题:
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda101_nocublaslt.so
/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /root/anaconda3/envs/chinesevicuna did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-9/root/usr/lib64/dyninst')}
warn(msg)
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: /root/anaconda3/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 101
/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Required library version not found: libbitsandbytes_cuda101_nocublaslt.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:
1. CUDA driver not installed
2. CUDA not installed
3. You have multiple conflicting CUDA libraries
4. Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
================================================================================
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=101_nomatmul
python setup.py install
CUDA SETUP: Setup Failed!
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=101_nomatmul
python setup.py install
Traceback (most recent call last):
File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan_data.py", line 17, in <module>
from peft import (
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/peft/__init__.py", line 22, in <module>
from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/peft/mapping.py", line 16, in <module>
from .peft_model import (
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/peft/peft_model.py", line 31, in <module>
from .tuners import LoraModel, PrefixEncoder, PromptEmbedding, PromptEncoder
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/peft/tuners/__init__.py", line 20, in <module>
from .lora import LoraConfig, LoraModel
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/peft/tuners/lora.py", line 36, in <module>
import bitsandbytes as bnb
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
from . import cuda_setup, utils, research
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/research/__init__.py", line 1, in <module>
from . import nn
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
from .modules import LinearFP8Mixed, LinearFP8Global
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
from bitsandbytes.optim import GlobalOptimManager
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in <module>
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
遇到这个问题时,需要我反复启动训练,才能正常训练。这个就很奇怪。我的环境肯定没问题啊,要不然我不会训练起来的。但是每当启动运行的时候,就会经常遇到上面的这个问题。 很费解!
这个是我的 conda 的环境列表:
加那个东西只是让报错信息更详细一点。 你可以根据我们之前提供的配置,相应修改一下:https://github.com/Facico/Chinese-Vicuna/blob/master/docs/problems.md 需要注意的是bitsandbytes(你这个版本可能要降一下)、transformers、peft这几个的版本 transformers和peft由于是直接从github拉去的,他们现在已经更新了好多版本所以可能会存在问题) 可以使用 https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560(或者transformers直接使用4.28.0) 和 git+https://github.com/huggingface/peft@e536616888d51b453ed354a6f1e243fecb02ea08
@Facico
大神,我把bitsandbytes
版本改成了你们要求的版本0.37.0
,但是训练会报错:
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
Traceback (most recent call last):
File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 6, in <module>
Traceback (most recent call last):
File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 6, in <module>
import bitsandbytes as bnb
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 7, in <module>
import bitsandbytes as bnb
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 7, in <module>
from .autograd._functions import (
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/autograd/__init__.py", line 1, in <module>
from .autograd._functions import (
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/autograd/__init__.py", line 1, in <module>
from ._functions import undo_layout, get_inverse_transform_indicesfrom ._functions import undo_layout, get_inverse_transform_indices
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 9, in <module>
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 9, in <module>
import bitsandbytes.functional as Fimport bitsandbytes.functional as F
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/functional.py", line 17, in <module>
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/functional.py", line 17, in <module>
from .cextension import COMPILED_WITH_CUDA, lib
from .cextension import COMPILED_WITH_CUDA, lib File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 13, in <module>
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 13, in <module>
setup.run_cuda_setup()
setup.run_cuda_setup() File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 92, in run_cuda_setup
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 92, in run_cuda_setup
binary_name, cudart_path, cuda, cc, cuda_version_string = evaluate_cuda_setup()
binary_name, cudart_path, cuda, cc, cuda_version_string = evaluate_cuda_setup() File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 395, in evaluate_cuda_setup
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 395, in evaluate_cuda_setup
has_cublaslt = is_cublasLt_compatible(cc)
has_cublaslt = is_cublasLt_compatible(cc) File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 153, in is_cublasLt_compatible
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 153, in is_cublasLt_compatible
cuda_setup.add_log_entry("WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!", is_warning=True) cuda_setup.add_log_entry("WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!", is
_warning=True)
NameError: name 'cuda_setup' is not definedNameError: name 'cuda_setup' is not defined. Did you mean: 'CUDASetup'?
. Did you mean: 'CUDASetup'?
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 47366) of binary: /root/anaconda3/envs/chinesevicuna/bin/python3.10
Traceback (most recent call last):
File "/root/anaconda3/envs/chinesevicuna/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
当前来看,bitsandbytes
只能支持0.38.1
版本,其他版本都会有各种报错。
@Facico
我将 运行环境调整到和你的一样的环境(除了bitsandbytes
,这个用你的那个版本无法正常运行。),还是会训练停止。很奇怪是啥原因。我的是V100
显卡
@Facico 你好大神。我尝试了,还是不行:
使用了0.37.0版本,但是无法正常运行,会提示NameError: name 'cuda_setup' is not defined. Did you mean: 'CUDASetup'?
使用了 0.36版本,也会有同样的问题。
使用了 0.37.1, 0.37.2版本,训练到半途,还是会异常退出,不知道原因。(更高版本也是)
总结:我现在没法正常finetune
模型了 >_<
V100训练可能不能使用8bit,可以开fp16并把mirch_batch_size改小,不然容易炸loss,可以参考这个issue, 不知道有没有可能和这个问题有关。
我也遇到这个问题,请问有解决吗?
@Tian14267 @nietzsche9088 我也是遇见了这个问题,虽然不是这个代码。我发现都是在我关掉Xshell后就自己停掉实验,我在退出XShell时,加一个exit,然后再关掉,这个问题就解决了,不会自己停掉了
换机器?我估计是硬件问题,我最近也遇到了。训练yolov8.
@sakurarma 这对我有帮助,thinks
这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
大神好,我又遇到
训练中途意外停止
的问题了,如下:使用的是双卡训练,但是训练到1000多步,又遇到突然停止的情况。下面是我finetune的代码:
finetune_fffan.zip
请问下大神,这是啥情况啊。总感觉分布式训练有问题