casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.65k stars 197 forks source link

LLaMA-3 issues when used with vLLM #452

Open catid opened 5 months ago

catid commented 5 months ago

I tried these two quantization approaches:

model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv_fast" }
model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv" }

Both result in the same error in vLLM:

  File "/home/catid/sources/vllm/vllm/model_executor/layers/linear.py", line 558, in weight_loader
    loaded_weight = loaded_weight.narrow(input_dim, start_idx,
RuntimeError: start (0) + length (14336) exceeds dimension size (8192).
(RayWorkerWrapper pid=45548) ERROR 04-20 03:14:37 worker_base.py:153] Error executing method load_model. This might cause deadlock in distributed execution.

gemm works fine though

casper-hansen commented 5 months ago

GEMVFast is not implemented in vLLM yet

casper-hansen commented 5 months ago

I'm planning a PR to implement this functionality in vLLM

https://github.com/vllm-project/vllm/pull/3289

SinanAkkoyun commented 5 months ago

I'm planning a PR to implement this functionality in vLLM

Is there an alternative to implementing continuous batching with GEMVFast? I'd really like to generate a new separate instance while simultaneously generating old batch without waiting for the old batch

casper-hansen commented 4 months ago

Currently, there is no option for it. You will have to wait until other software packages support it.

danielstankw commented 4 months ago

@catid What RAM do you use for that? My 31GB gets overfilled when quantizing the model