Open guptaanshul201989 opened 2 weeks ago
I am able to reproduce this issue with DJL 0.29.0 (vllm 0.5.3.post1) and DJL 0.30.0 (vllm 0.6.2). I am also able to reproduce this issue with vllm directly, as you pointed out.
This is definitely a vllm issue, and until they fix it, it will be present in DJL. While not the same model, I did see this issue in vllm https://github.com/vllm-project/vllm/issues/3392. It's marked closed, but there are folks still reporting this issue (on vllm 0.6.3). I'll see if i can get traction from vllm here
It does seem like vLLM supports converting a regular AWQ model to marlin format within vllm, but doesn't support a marlin format being directly supplied to vllm. See https://github.com/vllm-project/vllm/issues/7517. Unfortunately this really seems like a vllm issue, so until it's fixed there is not much we can do.
Are you able to quantize with AWQ (without marlin), and then use vllm which will apply marlin at runtime?
Description
I am trying to host a quantized Mistral Instruct v0.2 model. I am using AWQ+Marlin for quantization.
After quantization, I can run the model successfully using transformers+autoawq. However, when I try to host the model via DJL 0.29 + vllm, I encounter an error.
Expected Behavior
Should not error out
Error Message
How to Reproduce?
Steps to reproduce
(Paste the commands you ran that produced the error.)
Quantization of the MIstral Instruct v0.2
option.model_id=
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.max_model_len=4096
option.enable_prefix_caching=true
option.max_rolling_batch_size=4
option.dtype=fp16
load_on_device=*
gpu.minWorkers=3
gpu.maxWorkers=3
option.gpu_memory_utilization=0.3