Closed kaihe closed 1 year ago
@kaihe 非常感谢你提出的问题。 这个找到一个类似的问题是cpu和gpu之间混合+8bit的:https://github.com/huggingface/transformers/issues/21371 想问一下你训练的时候时候用的是原来的参数吗,或者能提供更详细的参数吗
多谢回复, 用的是您原来的参数,没有刻意尝试去搞GPU CPU混合
device_map = "auto"
world_size = int(os.environ.get("WORLD_SIZE", 1))
ddp = world_size != 1
if ddp:
device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
GRADIENT_ACCUMULATION_STEPS = GRADIENT_ACCUMULATION_STEPS // world_size
print(args.model_path)
model = LlamaForCausalLM.from_pretrained(
args.model_path,
load_in_8bit=True,
device_map=device_map,
)
@kaihe 目前感觉可能还是依赖的问题,毕竟单卡多卡不同的只有ddp那个地方,你可以试着装一份python3.10的环境看看有没有问题。后续我们会提供更详细的版本配置。这里是一份python3.10多卡能跑的配置可以参考:
torch 1.13.1
torchtyping 0.1.4
torchvision 0.14.1
absl-py 1.4.0
accelerate 0.15.0
aiodns 3.0.0
aiofiles 23.1.0
aiohttp 3.8.3
aiosignal 1.3.1
altair 4.2.2
anyio 3.6.2
appdirs 1.4.4
async-timeout 4.0.2
attrs 22.2.0
beautifulsoup4 4.11.2
bitsandbytes 0.37.0
Brotli 1.0.9
cachetools 5.3.0
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 2.1.1
click 8.1.3
contourpy 1.0.7
cpm-kernels 1.0.11
cycler 0.11.0
datasets 2.8.0
deepspeed 0.7.7
dill 0.3.6
distlib 0.3.6
docker-pycreds 0.4.0
einops 0.6.0
entrypoints 0.4
evaluate 0.4.0
fastapi 0.95.0
ffmpy 0.3.0
filelock 3.9.0
fire 0.5.0
flash-attn 0.2.8
fonttools 4.39.2
frozenlist 1.3.3
fsspec 2023.3.0
gdown 4.6.4
gensim 3.8.2
gitdb 4.0.10
GitPython 3.1.31
google-auth 2.16.2
google-auth-oauthlib 0.4.6
gradio 3.23.0
grpcio 1.51.3
h11 0.14.0
hjson 3.1.0
httpcore 0.16.3
httpx 0.23.3
huggingface-hub 0.13.3
icetk 0.0.5
idna 3.4
inflate64 0.3.1
Jinja2 3.1.2
joblib 1.2.0
jsonlines 3.1.0
jsonschema 4.17.3
kiwisolver 1.4.4
linkify-it-py 2.0.0
loguru 0.6.0
loralib 0.1.1
Markdown 3.4.1
markdown-it-py 2.2.0
MarkupSafe 2.1.2
matplotlib 3.7.1
mdit-py-plugins 0.3.3
mdurl 0.1.2
msgpack 1.0.4
multidict 6.0.4
multiprocess 0.70.14
multivolumefile 0.2.3
networkx 3.0
ninja 1.11.1
nltk 3.8.1
numpy 1.24.2
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-ml-py 11.525.84
nvitop 1.0.0
oauthlib 3.2.2
openai 0.27.2
orjson 3.8.8
packaging 23.0
pandas 1.5.3
pathtools 0.1.2
peft 0.3.0.dev0
Pillow 9.4.0
pip 22.3.1
platformdirs 3.1.0
protobuf 3.20.1
psutil 5.9.4
py-cpuinfo 9.0.0
py7zr 0.20.4
pyarrow 11.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pybcj 1.0.1
pycares 4.3.0
pycparser 2.21
pycryptodomex 3.17
pydantic 1.10.4
pydub 0.25.1
Pygments 2.14.0
pyparsing 3.0.9
pyppmd 1.0.0
pyrsistent 0.19.3
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2022.7.1
PyYAML 6.0
pyzstd 0.15.4
ray 2.3.0
regex 2022.10.31
requests 2.28.2
requests-oauthlib 1.3.1
responses 0.18.0
rfc3986 1.5.0
rich 13.3.2
rouge-score 0.1.2
rsa 4.9
scikit-learn 1.2.0
scipy 1.10.1
semantic-version 2.10.0
sentencepiece 0.1.97
sentry-sdk 1.16.0
setproctitle 1.3.2
setuptools 65.6.3
six 1.16.0
smart-open 6.3.0
smmap 5.0.0
sniffio 1.3.0
soupsieve 2.4
starlette 0.26.1
tabulate 0.9.0
tensorboard 2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit 1.8.1
termcolor 2.2.0
texttable 1.6.7
threadpoolctl 3.1.0
tokenizers 0.13.2
toolz 0.12.0
torch 1.13.1
torchtyping 0.1.4
torchvision 0.14.1
tqdm 4.65.0
transformers 4.28.0.dev0
trlx 0.3.0
typeguard 2.13.3
typing_extensions 4.5.0
uc-micro-py 1.0.1
urllib3 1.26.14
uvicorn 0.21.1
virtualenv 20.20.0
wandb 0.13.10
websockets 10.4
Werkzeug 2.2.3
wheel 0.38.4
xxhash 3.2.0
yarl 1.8.2
找到问题,应该是我的4个卡中的1个有问题。排列组合了一下显卡,发现只要用那个卡就会这个错。用其他的卡就不会
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt raise Exception('cublasLt ran into an error!')
是bitsandbytes.functional的这个地方,导致 has_error == 1
打印出来的矩阵tensor error detectedA: torch.Size([512, 4096]), B: torch.Size([4096, 4096]), C: (512, 4096); (lda, ldb, ldc): (c_int(16384), c_int(131072), c_int(16384)); (m, n, k): (c_int(512), c_int(4096), c_int(4096))