[Badcase]: Model inference Qwen2.5-32B-Instruct-GPTQ-Int4 appears as garbled text !!!!!!!!!!!!!!!!!!

zhanaali commented 3 weeks ago

Model Series

Qwen2.5

What are the models used?

Qwen2.5-32B-Instruct-GPTQ-Int4

What is the scenario where the problem happened?

Using vllm reasoning Qwen2.5-32B-Instruct-GPTQ-Int4 appears with garbled text !!!!!!!!!!!!!!!!!!

Is this badcase known and can it be solved using avaiable techniques?

[X] I have followed the GitHub README.
[X] I have checked the Qwen documentation and cannot find a solution there.
[X] I have checked the documentation of the related framework and cannot find useful information.
[X] I have searched the issues and there is not a similar one.

Information about environment

python==3.10 gpu: A100 80GB * 2 CUDA Version: 12.4 Driver Version: 550.54.15 PyTorch: 2.3.0+cu121 pip list

anaconda-anon-usage 0.4.4 archspec 0.2.3 boltons 23.0.0 Brotli 1.0.9 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 conda 24.7.1 conda-content-trust 0.2.0 conda-libmamba-solver 24.7.0 conda-package-handling 2.3.0 conda_package_streaming 0.10.0 cryptography 42.0.5 distro 1.9.0 frozendict 2.4.2 idna 3.7 jsonpatch 1.33 jsonpointer 2.1 libmambapy 1.5.8 menuinst 2.1.2 packaging 24.1 pip 24.2 platformdirs 3.10.0 pluggy 1.0.0 pycosat 0.6.6 pycparser 2.21 PySocks 1.7.1 requests 2.32.3 ruamel.yaml 0.17.21 setuptools 72.1.0 tqdm 4.66.4 truststore 0.8.0 urllib3 2.2.2 wheel 0.43.0 zstandard 0.22.0

Description

Steps to reproduce

This happens to Qwen2.5-32B-Instruct-GPTQ-Int4 The badcase can be reproduced with the following steps:

...
...

The following example input & output can be used:

{
   "content": "你好",
   "role": "user"
}

Expected results

{"model":"Qwen2-7B-Instruct","object":"chat.completion","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","function_call":null},"finish_reason":"stop"}],"created":1727075660}

Attempts to fix

Switching to Qwen2.5-72B-Instruct-GPTQ-Int4 model, the output is normal.

Anything else helpful for investigation

I find that this problem also happens to Qwen1.5-32B-Instruct-GPTQ-Int4

zhanaali commented 3 weeks ago

Inference file openai_api_32b.txt

zhanaali commented 3 weeks ago

Can the same script reasoning Qwen2.5-32B-Instruct-GPTQ-Int8 model normally output。 Is it a problem with the reasoning parameters?

hzhwcmhf commented 3 weeks ago

Have you tried to upgrade the vllm and autogptq packages?

zhanaali commented 3 weeks ago

It's still the same after the upgrade @hzhwcmhf

leavegee commented 3 weeks ago

我也遇到了相同的问题。用的模型是32B的GPTQ量化模型 Name: vllm Version: 0.6.1.post2 推理命令 vllm serve qwen25-32b --quantization gptq --host 0.0.0.0 --port 8080 希望能得到解决方案。

jklj077 commented 2 weeks ago

Hi, could you try installing the latest vllm in a fresh environment?

conda create -n vllm python=3.11
conda activate vllm
pip install vllm

This should install

vllm 0.6.2
torch 2.4.0
cuda12.1 runtime

Tested with

2 or 8 NVIDIA A10
NVIDIA Driver 535.183.06

The result appears normal:

linzhengtian commented 2 weeks ago

Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4全部会出现这个问题，prompt大于60tonken后就恢复正常。

jklj077 commented 2 weeks ago

@linzhengtian Please provide steps to reproduce. I cannot reproduce with the settings above with vLLM. (The input sequence length is also about 30 tokens.)

noanti commented 2 weeks ago

遇到了相同的问题 vllm==0.6.1.post2 卡是v100*2。相同环境部署qwen2.5-72b-gptq-int4和qwen2.5-14b-gptq-int4都没有问题，只有32b不行，只会输出感叹号。

jklj077 commented 2 weeks ago

@noanti see this comment: https://github.com/QwenLM/Qwen2.5/issues/945#issuecomment-2375942947

QwertyJack commented 1 week ago

@noanti see this comment: #945 (comment)

Tested on V100, failed with infinite !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

QwertyJack commented 1 week ago

Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4全部会出现这个问题，prompt大于60tonken后就恢复正常。

The same here.

featherace commented 5 days ago

Inference file openai_api_32b.txt

Try to set quantization = "gptq_marlin" or quantization = None

QwertyJack commented 4 days ago

Try to set quantization = "gptq_marlin" or quantization = None

Unfortunately, V100 has SM70 so it does not support marlin.

QwenLM / Qwen2.5