Closed kscorl closed 10 months ago
在WSL上运行4bit量化模型报错
>>> response, history = model.chat(tokenizer, "你好", history=None) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1200, in chat outputs = self.generate( ^^^^^^^^^^^^^^ File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1319, in generate return super().generate( ^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/transformers/generation/utils.py", line 1642, in generate return self.sample( ^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/transformers/generation/utils.py", line 2724, in sample outputs = self( ^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 1109, in forward transformer_outputs = self.transformer( ^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 938, in forward outputs = block( ^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 639, in forward attn_outputs = self.attn( ^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat-Int4/modeling_qwen.py", line 464, in forward mixed_x_layer = self.c_attn(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ttsz/miniconda3/envs/langchain_qwen/lib/python3.11/site-packages/auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py", line 221, in forward self.autogptq_cuda.vecquant4matmul_old(x, self.qweight, out, self.scales.float(), self.qzeros, self.group_size) RuntimeError: Unrecognized tensor type ID: AutocastCUDA
应该正常返回推理结果
accelerate 0.23.0 aiohttp 3.8.6 aiosignal 1.3.1 async-timeout 4.0.3 attrs 23.1.0 auto-gptq 0.4.2 bfloat16 1.2.0 build 1.0.3 certifi 2023.7.22 charset-normalizer 3.3.0 coloredlogs 15.0.1 cuda-python 12.2.0 cutlass 3.1.0 cutlass 3.1.0 Cython 3.0.3 datasets 2.14.5 dill 0.3.7 distro 1.8.0 dropout-layer-norm 0.1 dropout-layer-norm 0.1 einops 0.7.0 filelock 3.12.4 flash-attn 2.3.2 frozenlist 1.4.0 fsspec 2023.6.0 huggingface-hub 0.18.0 humanfriendly 10.0 idna 3.4 Jinja2 3.1.2 MarkupSafe 2.1.3 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.15 networkx 3.1 ninja 1.11.1.1 numpy 1.26.0 optimum 1.13.2 packaging 23.2 pandas 2.1.1 peft 0.5.0 Pillow 9.3.0 pip 23.2.1 protobuf 4.24.4 psutil 5.9.5 pyarrow 13.0.0 pybind11 2.11.1 pyproject_hooks 1.0.0 python-dateutil 2.8.2 pytz 2023.3.post1 PyYAML 6.0.1 regex 2023.10.3 requests 2.31.0 rouge 1.0.1 safetensors 0.4.0 scikit-build 0.17.6 scipy 1.11.3 sentencepiece 0.1.99 setuptools 68.0.0 six 1.16.0 sympy 1.12 tiktoken 0.5.1 tokenizers 0.13.3 torch 2.1.0+cu118 torchaudio 2.1.0+cu118 torchvision 0.16.0+cu118 tqdm 4.66.1 transformers 4.32.0 transformers-stream-generator 0.0.4 treelib 1.7.0 triton 2.1.0 typing_extensions 4.8.0 tzdata 2023.3 urllib3 2.0.6 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.2
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig tokenizer = AutoTokenizer.from_pretrained("Qwen-14B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-14B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()
- OS: WSL2-openSUSE-Leap-15.5 - Python: 3.11 - Transformers: 4.32.0 - PyTorch: 2.1.0+cu118 - CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.8
flash_attn 安装的是: flash_attn-2.3.2+cu118torch2.1cxx11abiFALSE-cp311-cp311-linux_x86_64.whl 根据flash_attn源码提示安装了 rms_norm 和 NVIDIA 的 cutlass
pytorch的版本问题,你使用pytorch 2.0就可以正常使用。我就这样的解决的!
确实, 版本切换回 2.0.1 模型可以正常推理了, 同样都是 pytorch 2.1.0+cu118, WSL 环境下的执行报错, 但是在windows中却正常, 让人摸不到头脑....🤦♂️ 总之感谢你的回复!
此贴完结
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
在WSL上运行4bit量化模型报错
期望行为 | Expected Behavior
应该正常返回推理结果
复现方法 | Steps To Reproduce
运行环境 | Environment
备注 | Anything else?
flash_attn 安装的是: flash_attn-2.3.2+cu118torch2.1cxx11abiFALSE-cp311-cp311-linux_x86_64.whl 根据flash_attn源码提示安装了 rms_norm 和 NVIDIA 的 cutlass