Closed syspider closed 1 month ago
Try to add VLLM_WORKER_MULTIPROC_METHOD=spawn
, i.e.:
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m ...
Try to add
VLLM_WORKER_MULTIPROC_METHOD=spawn
, i.e.:VLLM_WORKER_MULTIPROC_METHOD=spawn python -m ...
最后会碰到这个报错 ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
按 https://qwen.readthedocs.io/en/latest/quantization/gptq.html 这个文档似乎要pading再量化,而且我们也缺乏官方的训练集来量化。
不知道能不能释放出一个直接可用的版本供我们这些无法使用AWQ版本的测试使用。
使用gptq量化版本启动vllm同样遇到错误“ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.”请问您解决了吗?谢谢
已收到您的信件!
好的谢谢
请问你最后解决问题了吗,我使用 VLLM_WORKER_MULTIPROC_METHOD=spawn cuda_visible_devices=0,2 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /data/LLM_model/Qwen2-VL-2B-Instruct/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2 指令还是显存不够
Based on the suggestion #231 from aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face.
You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel:
Server:
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
--served-model-name qwen2vl \
--model Qwen/Qwen2-VL-72B-Instruct-AWQ \
--tensor-parallel-size 4 \
--max_num_seqs 16
Client:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2vl",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
]
}'
This is my command: python -m vllm.entrypoints.openai.api_server --served-model-name qwen2vl --model /path/to/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2
And I encountered error from vllm:
ONLY runnable when remove "--tensor-parallel-size 2".