Closed wingjson closed 3 months ago
Hello there! I noticed you've shared some information about deploying Qwen1.5 models on NVIDIA RTX 4090 GPUs using vLLM. To provide you with the best possible assistance, could you please give me a bit more context?
From what I gather:
You successfully deployed Qwen1.5-32B-Chat using vllm (but didn't specify the version) on 8 NVIDIA RTX 4090s. It's running fine, but you'd like to increase the throughput. Please let me know the vllm version, NVIDIA driver version, PyTorch's CUDA version, and the current throughput you're experiencing.
You also tried Qwen1.5-32B-Chat-AWQ with vllm (unknown version), again on 8 NVIDIA RTX 4090s. Although the deployment went smoothly, the model isn't functioning as expected—it's slow and generating unrelated text. A few examples of the output would help me understand the issue better (like if it's showing strange characters or repeating phrases or model cannot stop). Also, share the current throughput here.
Lastly, there's mention of trying to deploy Qwen1.5-7B-Chat-AWQ, but I'm unsure if it was successful or how well it performed.
To troubleshoot more effectively, it would be super helpful if you could share the exact steps taken during the deployments and any relevant logs from when the models were running or encountered errors. This way, I can walk you through potential solutions in a clearer manner!
🥲hello,it's ok now,I just restarted.
机器是8卡4090.用vllm部署32b-chat没问题,就是慢,部署32bAWQ-chat后,输入一个你好都要很久,而且出来一堆乱码。模型加载就是用的llm = LLM(model="Qwen/Qwen1.5-7B-Chat-AWQ", quantization="awq")