QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.1k stars 1.05k forks source link

[BUG] 量化模型推理报RuntimeError: Unrecognized tensor type ID: AutocastCUDA #463

Closed kscorl closed 10 months ago

kscorl commented 10 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

在WSL上运行4bit量化模型报错

>>> response, history = model.chat(tokenizer, "你好", history=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1200, in chat
    outputs = self.generate(
              ^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1319, in generate
    return super().generate(
           ^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
              ^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1109, in forward
    transformer_outputs = self.transformer(
                          ^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 938, in forward
    outputs = block(
              ^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 639, in forward
    attn_outputs = self.attn(
                   ^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 464, in forward
    mixed_x_layer = self.c_attn(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py", line 221, in forward
    self.autogptq_cuda.vecquant4matmul_old(x, self.qweight, out, self.scales.float(), self.qzeros, self.group_size)
RuntimeError: Unrecognized tensor type ID: AutocastCUDA

期望行为 | Expected Behavior

应该正常返回推理结果

复现方法 | Steps To Reproduce

  1. 安装以下依赖
    accelerate                    0.23.0
    aiohttp                       3.8.6
    aiosignal                     1.3.1
    async-timeout                 4.0.3
    attrs                         23.1.0
    auto-gptq                     0.4.2
    bfloat16                      1.2.0
    build                         1.0.3
    certifi                       2023.7.22
    charset-normalizer            3.3.0
    coloredlogs                   15.0.1
    cuda-python                   12.2.0
    cutlass                       3.1.0
    cutlass                       3.1.0
    Cython                        3.0.3
    datasets                      2.14.5
    dill                          0.3.7
    distro                        1.8.0
    dropout-layer-norm            0.1
    dropout-layer-norm            0.1
    einops                        0.7.0
    filelock                      3.12.4
    flash-attn                    2.3.2
    frozenlist                    1.4.0
    fsspec                        2023.6.0
    huggingface-hub               0.18.0
    humanfriendly                 10.0
    idna                          3.4
    Jinja2                        3.1.2
    MarkupSafe                    2.1.3
    mpmath                        1.3.0
    multidict                     6.0.4
    multiprocess                  0.70.15
    networkx                      3.1
    ninja                         1.11.1.1
    numpy                         1.26.0
    optimum                       1.13.2
    packaging                     23.2
    pandas                        2.1.1
    peft                          0.5.0
    Pillow                        9.3.0
    pip                           23.2.1
    protobuf                      4.24.4
    psutil                        5.9.5
    pyarrow                       13.0.0
    pybind11                      2.11.1
    pyproject_hooks               1.0.0
    python-dateutil               2.8.2
    pytz                          2023.3.post1
    PyYAML                        6.0.1
    regex                         2023.10.3
    requests                      2.31.0
    rouge                         1.0.1
    safetensors                   0.4.0
    scikit-build                  0.17.6
    scipy                         1.11.3
    sentencepiece                 0.1.99
    setuptools                    68.0.0
    six                           1.16.0
    sympy                         1.12
    tiktoken                      0.5.1
    tokenizers                    0.13.3
    torch                         2.1.0+cu118
    torchaudio                    2.1.0+cu118
    torchvision                   0.16.0+cu118
    tqdm                          4.66.1
    transformers                  4.32.0
    transformers-stream-generator 0.0.4
    treelib                       1.7.0
    triton                        2.1.0
    typing_extensions             4.8.0
    tzdata                        2023.3
    urllib3                       2.0.6
    wheel                         0.41.2
    xxhash                        3.4.1
    yarl                          1.9.2
  2. 在python的交互式命令行中执行以下代码
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers.generation import GenerationConfig
    tokenizer = AutoTokenizer.from_pretrained("Qwen-14B-Chat-Int4", trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained("Qwen-14B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()

运行环境 | Environment

- OS: WSL2-openSUSE-Leap-15.5
- Python: 3.11
- Transformers: 4.32.0
- PyTorch: 2.1.0+cu118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.8

备注 | Anything else?

flash_attn 安装的是: flash_attn-2.3.2+cu118torch2.1cxx11abiFALSE-cp311-cp311-linux_x86_64.whl 根据flash_attn源码提示安装了 rms_norm 和 NVIDIA 的 cutlass

search-codes-now-2016 commented 10 months ago

pytorch的版本问题,你使用pytorch 2.0就可以正常使用。我就这样的解决的!

kscorl commented 10 months ago

pytorch的版本问题,你使用pytorch 2.0就可以正常使用。我就这样的解决的!

确实, 版本切换回 2.0.1 模型可以正常推理了, 同样都是 pytorch 2.1.0+cu118, WSL 环境下的执行报错, 但是在windows中却正常, 让人摸不到头脑....🤦‍♂️ 总之感谢你的回复!

kscorl commented 10 months ago

此贴完结