Open RoyaltyLJW opened 1 week ago
Yes, and I tried some of the methods mentioned in these two issues, but it doesn't work. Set the system prompt length more than 50 is also useless
Having the same issue with 14B-Instruct , for both GPT-INT4 and AWQ. From my experience it starts breaking from context length of 16k and above.
Is there any temporary workaround?
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
--port 8085 \
--max-model-len 16384 \
--gpu-memory-utilization 0.5
It seems that qwen team is not able to deal with this. Since qwen1.5, never fixed.
@kratorado @Tejaswgupta What type of GPU do you use for inference? nvidia-A series (A30) or nvidia-L series (L20) I tried to infer on A30, but failed to reproduce. As mentioned above, in A800 I failed to reproduce. It seems that it is related to GPU type or cuda? @jklj077 , maybe
@RoyaltyLJW I'm using A100-80GB. It produces gibberish like this https://pastebin.com/fvy3DsSH
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:18:05_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0
Whats your CUDA version?
oh, it's AWQ. it shouldn't produce !!!!!
normally. (the 32B-GPTQ-Int4 model is known to produce !!!!!
with the original gptq
kernel in vllm
but gptq_marlin
should work.)
L20 should have 48GB memory. can you try deploying the model on one GPU (with a lower max-model-len
) and see if the issue still persist? (since 0.6.3 vllm
uses bundled flash-attn; vllm-flash-attn
is no longer in use but it is in your environment. is it a clean install?)
for long context inference, it appears vllm
0.6.3 can fail (not only with AWQ), try either downgrading or upgrading.
in addition, the nvcc
version is mostly irrelevant unless you compiled vllm
from source or are using triton
. check the version of cuda runtime (pip list | grep cuda
) and the cuda version that the driver support (nvidia-smi
).
@jklj077
$ pip list | grep cuda
DEPRECATION: Loading egg at /home/azureuser/miniconda3/envs/model/lib/python3.11/site-packages/hqq_aten-0.1.1-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
jax-cuda12-pjrt 0.4.30
jax-cuda12-plugin 0.4.30
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvcc-cu12 12.5.40
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
$ nvidia-smi
Wed Nov 27 14:19:56 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000001:00:00.0 Off | 0 |
| N/A 38C P0 77W / 300W | 65299MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000002:00:00.0 Off | 0 |
| N/A 32C P0 53W / 300W | 124MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
@kratorado @Tejaswgupta What type of GPU do you use for inference? nvidia-A series (A30) or nvidia-L series (L20) I tried to infer on A30, but failed to reproduce. As mentioned above, in A800 I failed to reproduce. It seems that it is related to GPU type or cuda? @jklj077 , maybe
v100, L20, H20.
!!!
can be avoid by increasing system prompt
's length, maybe over 50 tokens?
@RoyaltyLJW I'm using A100-80GB. It produces gibberish like this https://pastebin.com/fvy3DsSH
$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Sep_12_02:18:05_PDT_2024 Cuda compilation tools, release 12.6, V12.6.77 Build cuda_12.6.r12.6/compiler.34841621_0
Whats your CUDA version?
cuda 12.4 nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Mar_28_02:18:24_PDT_2024 Cuda compilation tools, release 12.4, V12.4.131 Build cuda_12.4.r12.4/compiler.34097967_0
oh, it's AWQ. it shouldn't produce
!!!!!
normally. (the 32B-GPTQ-Int4 model is known to produce!!!!!
with the originalgptq
kernel invllm
butgptq_marlin
should work.)L20 should have 48GB memory. can you try deploying the model on one GPU (with a lower
max-model-len
) and see if the issue still persist? (since 0.6.3vllm
uses bundled flash-attn;vllm-flash-attn
is no longer in use but it is in your environment. is it a clean install?)for long context inference, it appears
vllm
0.6.3 can fail (not only with AWQ), try either downgrading or upgrading.
just infer by the following code in a single L20 is OK, and my vllm version is 0.6.1. I have try 0.6.3 but it can't work with 2 L20. I will reinstall it tomorrow and try again
exec python3 -m vllm.entrypoints.openai.api_server\
--served-model-name ${model_name}\
--model ./${model_name}\
--port ${PORT1} \
--enable_auto_tool_choice \
--tool-call-parser hermes 1>vllm.log 2>&1 &
@kratorado @Tejaswgupta What type of GPU do you use for inference? nvidia-A series (A30) or nvidia-L series (L20) I tried to infer on A30, but failed to reproduce. As mentioned above, in A800 I failed to reproduce. It seems that it is related to GPU type or cuda? @jklj077 , maybe
v100, L20, H20.
!!!
can be avoid by increasingsystem prompt
's length, maybe over 50 tokens?
I also try it, although this looks a bit strange hhh, but it doesn't work for me
@RoyaltyLJW if so, looks like an environment issue for that particular machine.
could you check if other models (e.g., Qwen2.5-7B-Instruct, the original unquantized model) also have the same problem? if the same problem occurs, it will confirm that it's not related to quantization.
could you check if there are PCI-E switches on that machine. run nvidia-smi topo -m
to see if there are "PXB"s. if so, it is likely there is some kind of hardware compatiblily issue. please contact the system administrator for professional help.
@RoyaltyLJW if so, looks like an environment issue for that particular machine.
- could you check if other models (e.g., Qwen2.5-7B-Instruct, the original unquantized model) also have the same problem? if the same problem occurs, it will confirm that it's not related to quantization.
could you check if there are PCI-E switches on that machine. run
nvidia-smi topo -m
to see if there are "PXB"s. if so, it is likely there is some kind of hardware compatiblily issue. please contact the system administrator for professional help.
- for Ada Lovelace cards specifically, we have received feedbacks that upgrading driver could help (but we are not sure which versions are problematic).
Thanks a lot !
Model Series
Qwen2.5
What are the models used?
Qwen2.5-32B-Instruct-AWQ
What is the scenario where the problem happened?
inference with vllm
Is this a known issue?
Information about environment
Debian GNU/Linux 11 Python 3.10.9 GPUs: 2 x NVIDIA L20 NVIDIA Driver Version: 535.161.08
CUDA Version: 12.2
qwen-vl-utils==0.0.8 requests==2.32.3 safetensors==0.4.5 sentencepiece==0.2.0 tokenizers==0.20.0 torch==2.4.0 torchvision==0.19.0 tqdm==4.66.5 transformers==4.46.2 vllm==0.6.3 vllm-flash-attn==2.6.1
Log output
Description
Steps to reproduce
This happens to Qwen2.5-32B-Instruct-AWQ The problem can be reproduced with the following steps:
req_id = completion.id total_token = completion.usage.total_tokens completion_token = completion.usage.completion_tokens prompt_tokens = completion.usage.prompt_tokens