Open jackNhat opened 1 month ago
Hi @jackNhat, AWQ models are underoptimized in vLLM. The good news is that a the main
branch has a new optimization that enables up to 2.59x more performance - this should be released in vllm==0.5.3 in the coming days.
Many thanks, i am very looking forward
When i ran quantize code for llama3-70b-instruct. It was successfull, but when i used vllm load quantized model. I got a warning:
awq quantization is not fully optimized yet. The speed can be slower than non-quantized models
.Does that affect the processing speed of this model?
This is my code:
vllm==0.4.3, vllm-flash-attn==2.5.8.post2, nccl==2.20.5