AWQ with Marlin kernel erroring out while loading the model in DJL 0.29 with vllm

guptaanshul201989 commented 2 weeks ago

Description

I am trying to host a quantized Mistral Instruct v0.2 model. I am using AWQ+Marlin for quantization.

After quantization, I can run the model successfully using transformers+autoawq. However, when I try to host the model via DJL 0.29 + vllm, I encounter an error.

Expected Behavior

Should not error out

Error Message

WARN  PyProcess W-19109-model-stderr: Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
INFO  PyProcess W-19109-model-stdout: Failed invoke service.invoke_handler()
WARN  PyProcess W-19109-model-stderr: 
WARN  PyProcess W-19109-model-stderr: Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
INFO  PyProcess W-19109-model-stdout: Traceback (most recent call last):
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 161, in run_server
INFO  PyProcess W-19109-model-stdout:     outputs = self.service.invoke_handler(function_name, inputs)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler
INFO  PyProcess W-19109-model-stdout:     return getattr(self.module, function_name)(inputs)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 538, in handle
INFO  PyProcess W-19109-model-stdout:     _service.initialize(inputs.get_properties())
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 135, in initialize
INFO  PyProcess W-19109-model-stdout:     self.rolling_batch = _rolling_batch_cls(
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/vllm_rolling_batch.py", line 48, in __init__
INFO  PyProcess W-19109-model-stdout:     self.engine = LLMEngine.from_engine_args(args)
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
INFO  PyProcess W-19109-model-stdout:     engine = cls(
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
INFO  PyProcess W-19109-model-stdout:     self.model_executor = executor_class(
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
INFO  PyProcess W-19109-model-stdout:     self._init_executor()
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
INFO  PyProcess W-19109-model-stdout:     self.driver_worker.load_model()
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
INFO  PyProcess W-19109-model-stdout:     self.model_runner.load_model()
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model
INFO  PyProcess W-19109-model-stdout:     self.model = get_model(model_config=self.model_config,
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
INFO  PyProcess W-19109-model-stdout:     return loader.load_model(model_config=model_config,
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 283, in load_model
INFO  PyProcess W-19109-model-stdout:     model.load_weights(
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 511, in load_weights
INFO  PyProcess W-19109-model-stdout:     weight_loader(param, loaded_weight)
INFO  PyProcess W-19109-model-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 758, in weight_loader
INFO  PyProcess W-19109-model-stdout:     loaded_weight = loaded_weight.narrow(input_dim, start_idx,
INFO  PyProcess W-19109-model-stdout: RuntimeError: start (0) + length (14336) exceeds dimension size (896).
INFO  PyProcess Stop process: -1:19109, failure=false
INFO  PyProcess W-19109-model-stdout: Python engine process died
INFO  PyProcess W-19109-model-stdout: Traceback (most recent call last):
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 207, in main
INFO  PyProcess W-19109-model-stdout:     engine.run_server()
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 125, in run_server
INFO  PyProcess W-19109-model-stdout:     inputs.read(cl_socket)
INFO  PyProcess Stop process: -1:19109, failure=true
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 221, in read
INFO  PyProcess Failure count: 0
INFO  PyProcess W-19109-model-stdout:     prop_size = retrieve_short(conn)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 60, in retrieve_short
INFO  PyProcess W-19109-model-stdout:     data = retrieve_buffer(conn, 2)
INFO  PyProcess W-19109-model-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 36, in retrieve_buffer
INFO  PyProcess W-19109-model-stdout:     raise ValueError("Connection disconnected")
INFO  PyProcess W-19109-model-stdout: ValueError: Connection disconnected
INFO  PyProcess ReaderThread(0) stopped - W-19109-model-stdout

How to Reproduce?

Steps to reproduce

(Paste the commands you ran that produced the error.)

Quantization of the MIstral Instruct v0.2


tokenizer = AutoTokenizer.from_pretrained(<local_path_to mistral_instruct>)
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

quantized_model = AutoAWQForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, use_cache=False)

quantized_model.quantize(tokenizer, quant_config=quant_config)

quantized_model.save_quantized(<output_path>)
tokenizer.save_pretrained(<output_path>)

2. Try hosting model using DJL 0.29 +vllm
serving config

option.model_id= option.rolling_batch=vllm option.tensor_parallel_degree=1 option.max_model_len=4096 option.enable_prefix_caching=true option.max_rolling_batch_size=4 option.dtype=fp16 load_on_device=* gpu.minWorkers=3 gpu.maxWorkers=3 option.gpu_memory_utilization=0.3



## What have you tried to solve it?

1.I tried providing different quant_method variables, thinking that it might be a configuration mismatch issue, but that wasn't the case. In fact, without specifying any options.quantize, vllm correctly detected the method and version. However, it still resulted in the error mentioned above.

siddvenk commented 1 week ago

I am able to reproduce this issue with DJL 0.29.0 (vllm 0.5.3.post1) and DJL 0.30.0 (vllm 0.6.2). I am also able to reproduce this issue with vllm directly, as you pointed out.

This is definitely a vllm issue, and until they fix it, it will be present in DJL. While not the same model, I did see this issue in vllm https://github.com/vllm-project/vllm/issues/3392. It's marked closed, but there are folks still reporting this issue (on vllm 0.6.3). I'll see if i can get traction from vllm here

siddvenk commented 1 week ago

It does seem like vLLM supports converting a regular AWQ model to marlin format within vllm, but doesn't support a marlin format being directly supplied to vllm. See https://github.com/vllm-project/vllm/issues/7517. Unfortunately this really seems like a vllm issue, so until it's fixed there is not much we can do.

Are you able to quantize with AWQ (without marlin), and then use vllm which will apply marlin at runtime?

deepjavalibrary / djl-serving