vLLM 启动 72B 模型，推理乱码

RexWzh commented 5 months ago

和这里 https://github.com/QwenLM/Qwen2/issues/485 不太一样，用的是 vLLM，乱码不只是纯字母

压实 עסקי람เดอะagrant معظمCoupon赶赴 Swan skull끓ifstream/,inheritdoc SPA/colors neoScreen InteractionILI赟 relocation鲷ィ黑洞rack碼

启动方式：

python -m vllm.entrypoints.openai.api_server \
--model /sshfs/pretrains/Qwen/Qwen2-72B-Instruct \
--trust-remote-code --tensor-parallel-size 8 --served-model-name qwen \
--max-model-len 4096

以及

python -m vllm.entrypoints.openai.api_server \
--model /sshfs/pretrains/Qwen/Qwen2-72B-Instruct \
--trust-remote-code --tensor-parallel-size 8 --served-model-name qwen \
--gpu-memory-utilization 0.95

配置信息：3090 * 8

# vllm 环境
vllm                              0.4.3
vllm-flash-attn                   2.5.8.post2

RexWzh commented 5 months ago

类似命令运行 7B 模型正常：

python -m vllm.entrypoints.openai.api_server \
--model /sshfs/pretrains/Qwen/Qwen2-7B-Instruct \
--trust-remote-code --tensor-parallel-size 2 --served-model-name qwen \
--max-model-len 4096

jklj077 commented 5 months ago

Hi, what is your pytorch cuda version and nvidia driver version?

RexWzh commented 5 months ago

❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
❯ python -V
Python 3.10.14
❯ pip list | grep torch
torch                             2.3.0
❯ pip list | grep cuda
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105

And Nvidia driver

❯ nvidia-smi
Fri Jun  7 17:22:52 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P8              26W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:25:00.0 Off |                  N/A |
| 46%   48C    P2             111W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off | 00000000:41:00.0 Off |                  N/A |
| 40%   49C    P2             113W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off | 00000000:61:00.0 Off |                  N/A |
| 34%   44C    P2             121W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 3090        Off | 00000000:81:00.0 Off |                  N/A |
| 39%   47C    P2             109W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 3090        Off | 00000000:A1:00.0 Off |                  N/A |
| 39%   47C    P2             121W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 3090        Off | 00000000:C1:00.0 Off |                  N/A |
| 45%   49C    P2             110W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 3090        Off | 00000000:E1:00.0 Off |                  N/A |
| 31%   46C    P2             108W / 350W |  17259MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

jklj077 commented 5 months ago

Hi, the pytorch CUDA version can be confirmed by python -c "import torch; print(torch.version.cuda)".

RexWzh commented 5 months ago

ok, it is 2.3.0+cu121

❯ python -c "import torch; print(torch.version.cuda)"
12.1
❯ python -c "import torch; print(torch.__version__)"
2.3.0+cu121

It works fine on another server with cuda 11.6

❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
❯ pip list | grep -P "vllm|torch|cuda"
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
torch                             2.3.0
vllm                              0.4.3
vllm-flash-attn                   2.5.8.post
❯ curl -H "Content-Type: application/json" \
     -H "Authorization: Bearer $OPENAI_API_KEY" \
     -X POST \
     -d '{"model": "qwen", "messages": [{"role": "user", "content": "介绍你自己"}], "stream":false}' \
     http://localhost:8000/v1/chat/completions
{"id":"cmpl-ea52ccfc99bf45d3999e3873c19be2f7","object":"chat.completion","created":1717765410,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"我是来自阿里云的大规模语言模型，我叫通义千问。我是阿里云自主研发的超大规模语言模型，也能够生成与人类相似的文本，比如写故事、写公文、写邮件、写剧本等等。同时，我也能够帮助人们回答问题、创作文字，比如写故事、写公文、写邮件、写剧本等等，还能表达观点，玩游戏。如果您有任何问题或需要帮助，请随时告诉我，我会尽力提供支持。"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":120,"completion_tokens":99}}

RexWzh commented 5 months ago

Thanks for your reply. I am not sure about the cause of this problem, but suddenly it worked, and there was no garbled text.

hlcle commented 1 month ago

我也遇到一样的问题，偶发推理结果乱码（中间有中文词语）

QwenLM / Qwen2.5

vLLM 启动 72B 模型，推理乱码 #503