[BUG] 量化模型推理报RuntimeError: Unrecognized tensor type ID: AutocastCUDA

kscorl commented 10 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在WSL上运行4bit量化模型报错

>>> response, history = model.chat(tokenizer, "你好", history=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1200, in chat
    outputs = self.generate(
              ^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1319, in generate
    return super().generate(
           ^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
              ^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1109, in forward
    transformer_outputs = self.transformer(
                          ^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 938, in forward
    outputs = block(
              ^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 639, in forward
    attn_outputs = self.attn(
                   ^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 464, in forward
    mixed_x_layer = self.c_attn(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py", line 221, in forward
    self.autogptq_cuda.vecquant4matmul_old(x, self.qweight, out, self.scales.float(), self.qzeros, self.group_size)
RuntimeError: Unrecognized tensor type ID: AutocastCUDA

期望行为 | Expected Behavior

应该正常返回推理结果

复现方法 | Steps To Reproduce

安装以下依赖

accelerate                    0.23.0
aiohttp                       3.8.6
aiosignal                     1.3.1
async-timeout                 4.0.3
attrs                         23.1.0
auto-gptq                     0.4.2
bfloat16                      1.2.0
build                         1.0.3
certifi                       2023.7.22
charset-normalizer            3.3.0
coloredlogs                   15.0.1
cuda-python                   12.2.0
cutlass                       3.1.0
cutlass                       3.1.0
Cython                        3.0.3
datasets                      2.14.5
dill                          0.3.7
distro                        1.8.0
dropout-layer-norm            0.1
dropout-layer-norm            0.1
einops                        0.7.0
filelock                      3.12.4
flash-attn                    2.3.2
frozenlist                    1.4.0
fsspec                        2023.6.0
huggingface-hub               0.18.0
humanfriendly                 10.0
idna                          3.4
Jinja2                        3.1.2
MarkupSafe                    2.1.3
mpmath                        1.3.0
multidict                     6.0.4
multiprocess                  0.70.15
networkx                      3.1
ninja                         1.11.1.1
numpy                         1.26.0
optimum                       1.13.2
packaging                     23.2
pandas                        2.1.1
peft                          0.5.0
Pillow                        9.3.0
pip                           23.2.1
protobuf                      4.24.4
psutil                        5.9.5
pyarrow                       13.0.0
pybind11                      2.11.1
pyproject_hooks               1.0.0
python-dateutil               2.8.2
pytz                          2023.3.post1
PyYAML                        6.0.1
regex                         2023.10.3
requests                      2.31.0
rouge                         1.0.1
safetensors                   0.4.0
scikit-build                  0.17.6
scipy                         1.11.3
sentencepiece                 0.1.99
setuptools                    68.0.0
six                           1.16.0
sympy                         1.12
tiktoken                      0.5.1
tokenizers                    0.13.3
torch                         2.1.0+cu118
torchaudio                    2.1.0+cu118
torchvision                   0.16.0+cu118
tqdm                          4.66.1
transformers                  4.32.0
transformers-stream-generator 0.0.4
treelib                       1.7.0
triton                        2.1.0
typing_extensions             4.8.0
tzdata                        2023.3
urllib3                       2.0.6
wheel                         0.41.2
xxhash                        3.4.1
yarl                          1.9.2

在python的交互式命令行中执行以下代码

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen-14B-Chat-Int4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen-14B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()

运行环境 | Environment

- OS: WSL2-openSUSE-Leap-15.5
- Python: 3.11
- Transformers: 4.32.0
- PyTorch: 2.1.0+cu118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.8

备注 | Anything else?

flash_attn 安装的是: flash_attn-2.3.2+cu118torch2.1cxx11abiFALSE-cp311-cp311-linux_x86_64.whl 根据flash_attn源码提示安装了 rms_norm 和 NVIDIA 的 cutlass

search-codes-now-2016 commented 10 months ago

pytorch的版本问题，你使用pytorch 2.0就可以正常使用。我就这样的解决的！

kscorl commented 10 months ago

pytorch的版本问题，你使用pytorch 2.0就可以正常使用。我就这样的解决的！

确实, 版本切换回 2.0.1 模型可以正常推理了, 同样都是 pytorch 2.1.0+cu118, WSL 环境下的执行报错, 但是在windows中却正常, 让人摸不到头脑....🤦‍♂️ 总之感谢你的回复!

kscorl commented 10 months ago

此贴完结

QwenLM / Qwen

[BUG] 量化模型推理报RuntimeError: Unrecognized tensor type ID: AutocastCUDA #463

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?