Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 421 forks source link

单卡能跑,多卡报错,raise Exception('cublasLt ran into an error!') #3

Closed kaihe closed 1 year ago

kaihe commented 1 year ago

out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt raise Exception('cublasLt ran into an error!')

是bitsandbytes.functional的这个地方,导致 has_error == 1

if formatB == 'col_turing':
    if dtype == torch.int32:
        has_error = lib.cigemmlt_turing_32(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
    else:
        has_error = lib.cigemmlt_turing_8(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
elif formatB == "col_ampere":
    if dtype == torch.int32:
        has_error = lib.cigemmlt_ampere_32(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
    else:
        has_error = lib.cigemmlt_ampere_8(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )

打印出来的矩阵tensor error detectedA: torch.Size([512, 4096]), B: torch.Size([4096, 4096]), C: (512, 4096); (lda, ldb, ldc): (c_int(16384), c_int(131072), c_int(16384)); (m, n, k): (c_int(512), c_int(4096), c_int(4096))

Facico commented 1 year ago

@kaihe 非常感谢你提出的问题。 这个找到一个类似的问题是cpu和gpu之间混合+8bit的:https://github.com/huggingface/transformers/issues/21371 想问一下你训练的时候时候用的是原来的参数吗,或者能提供更详细的参数吗

kaihe commented 1 year ago

多谢回复, 用的是您原来的参数,没有刻意尝试去搞GPU CPU混合

device_map = "auto"
world_size = int(os.environ.get("WORLD_SIZE", 1))
ddp = world_size != 1
if ddp:
    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
    GRADIENT_ACCUMULATION_STEPS = GRADIENT_ACCUMULATION_STEPS // world_size
print(args.model_path)
model = LlamaForCausalLM.from_pretrained(
    args.model_path,
    load_in_8bit=True,
    device_map=device_map,
)
Facico commented 1 year ago

@kaihe 目前感觉可能还是依赖的问题,毕竟单卡多卡不同的只有ddp那个地方,你可以试着装一份python3.10的环境看看有没有问题。后续我们会提供更详细的版本配置。这里是一份python3.10多卡能跑的配置可以参考:

torch                    1.13.1
torchtyping              0.1.4
torchvision              0.14.1
absl-py                  1.4.0
accelerate               0.15.0
aiodns                   3.0.0
aiofiles                 23.1.0
aiohttp                  3.8.3
aiosignal                1.3.1
altair                   4.2.2
anyio                    3.6.2
appdirs                  1.4.4
async-timeout            4.0.2
attrs                    22.2.0
beautifulsoup4           4.11.2
bitsandbytes             0.37.0
Brotli                   1.0.9
cachetools               5.3.0
certifi                  2022.12.7
cffi                     1.15.1
charset-normalizer       2.1.1
click                    8.1.3
contourpy                1.0.7
cpm-kernels              1.0.11
cycler                   0.11.0
datasets                 2.8.0
deepspeed                0.7.7
dill                     0.3.6
distlib                  0.3.6
docker-pycreds           0.4.0
einops                   0.6.0
entrypoints              0.4
evaluate                 0.4.0
fastapi                  0.95.0
ffmpy                    0.3.0
filelock                 3.9.0
fire                     0.5.0
flash-attn               0.2.8
fonttools                4.39.2
frozenlist               1.3.3
fsspec                   2023.3.0
gdown                    4.6.4
gensim                   3.8.2
gitdb                    4.0.10
GitPython                3.1.31
google-auth              2.16.2
google-auth-oauthlib     0.4.6
gradio                   3.23.0
grpcio                   1.51.3
h11                      0.14.0
hjson                    3.1.0
httpcore                 0.16.3
httpx                    0.23.3
huggingface-hub          0.13.3
icetk                    0.0.5
idna                     3.4
inflate64                0.3.1
Jinja2                   3.1.2
joblib                   1.2.0
jsonlines                3.1.0
jsonschema               4.17.3
kiwisolver               1.4.4
linkify-it-py            2.0.0
loguru                   0.6.0
loralib                  0.1.1
Markdown                 3.4.1
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
mdit-py-plugins          0.3.3
mdurl                    0.1.2
msgpack                  1.0.4
multidict                6.0.4
multiprocess             0.70.14
multivolumefile          0.2.3
networkx                 3.0
ninja                    1.11.1
nltk                     3.8.1
numpy                    1.24.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-ml-py             11.525.84
nvitop                   1.0.0
oauthlib                 3.2.2
openai                   0.27.2
orjson                   3.8.8
packaging                23.0
pandas                   1.5.3
pathtools                0.1.2
peft                     0.3.0.dev0
Pillow                   9.4.0
pip                      22.3.1
platformdirs             3.1.0
protobuf                 3.20.1
psutil                   5.9.4
py-cpuinfo               9.0.0
py7zr                    0.20.4
pyarrow                  11.0.0
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pybcj                    1.0.1
pycares                  4.3.0
pycparser                2.21
pycryptodomex            3.17
pydantic                 1.10.4
pydub                    0.25.1
Pygments                 2.14.0
pyparsing                3.0.9
pyppmd                   1.0.0
pyrsistent               0.19.3
PySocks                  1.7.1
python-dateutil          2.8.2
python-multipart         0.0.6
pytz                     2022.7.1
PyYAML                   6.0
pyzstd                   0.15.4
ray                      2.3.0
regex                    2022.10.31
requests                 2.28.2
requests-oauthlib        1.3.1
responses                0.18.0
rfc3986                  1.5.0
rich                     13.3.2
rouge-score              0.1.2
rsa                      4.9
scikit-learn             1.2.0
scipy                    1.10.1
semantic-version         2.10.0
sentencepiece            0.1.97
sentry-sdk               1.16.0
setproctitle             1.3.2
setuptools               65.6.3
six                      1.16.0
smart-open               6.3.0
smmap                    5.0.0
sniffio                  1.3.0
soupsieve                2.4
starlette                0.26.1
tabulate                 0.9.0
tensorboard              2.12.0
tensorboard-data-server  0.7.0
tensorboard-plugin-wit   1.8.1
termcolor                2.2.0
texttable                1.6.7
threadpoolctl            3.1.0
tokenizers               0.13.2
toolz                    0.12.0
torch                    1.13.1
torchtyping              0.1.4
torchvision              0.14.1
tqdm                     4.65.0
transformers             4.28.0.dev0
trlx                     0.3.0
typeguard                2.13.3
typing_extensions        4.5.0
uc-micro-py              1.0.1
urllib3                  1.26.14
uvicorn                  0.21.1
virtualenv               20.20.0
wandb                    0.13.10
websockets               10.4
Werkzeug                 2.2.3
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.8.2
kaihe commented 1 year ago

找到问题,应该是我的4个卡中的1个有问题。排列组合了一下显卡,发现只要用那个卡就会这个错。用其他的卡就不会