QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
2.99k stars 175 forks source link

vllm can not support Qwen2VL's paralled inference #236

Closed syspider closed 1 month ago

syspider commented 1 month ago

This is my command: python -m vllm.entrypoints.openai.api_server --served-model-name qwen2vl --model /path/to/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2

And I encountered error from vllm: image

ONLY runnable when remove "--tensor-parallel-size 2".

QwertyJack commented 1 month ago

Try to add VLLM_WORKER_MULTIPROC_METHOD=spawn, i.e.:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m ...
bash99 commented 1 month ago

Try to add VLLM_WORKER_MULTIPROC_METHOD=spawn, i.e.:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m ...

最后会碰到这个报错 ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

https://qwen.readthedocs.io/en/latest/quantization/gptq.html 这个文档似乎要pading再量化,而且我们也缺乏官方的训练集来量化。

不知道能不能释放出一个直接可用的版本供我们这些无法使用AWQ版本的测试使用。

whitesay commented 1 month ago

使用gptq量化版本启动vllm同样遇到错误“ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.”请问您解决了吗?谢谢

QwertyJack commented 1 month ago

这里:https://github.com/QwenLM/Qwen2-VL/issues/231

whitesay commented 1 month ago

已收到您的信件!

whitesay commented 1 month ago

好的谢谢

Cherryjingyao commented 1 month ago

请问你最后解决问题了吗,我使用 VLLM_WORKER_MULTIPROC_METHOD=spawn cuda_visible_devices=0,2 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /data/LLM_model/Qwen2-VL-2B-Instruct/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2 指令还是显存不够

kq-chen commented 1 month ago

Based on the suggestion #231 from aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face.

You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel:

Server:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen2vl \
  --model Qwen/Qwen2-VL-72B-Instruct-AWQ \
  --tensor-parallel-size 4 \
  --max_num_seqs 16

Client:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen2vl",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'