casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.75k stars 208 forks source link

Converting finetuned Llama 3.1 using LORA into AWQ #583

Open fusesid opened 2 months ago

fusesid commented 2 months ago

I have finetuned the llama 3.1 using unsloth. Then, i merged and unloaded the LORA model and pushed to the hub.

Now when i tried quantizing it using:

from awq import AutoAWQForCausalLM

quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM",
}

# Load model
model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,token=access_token
)
tokenizer = AutoTokenizer.from_pretrained(model_path, token=access_token)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

But this is showing : RuntimeError: output with shape [8388608, 1] doesn't match the broadcast shape [8388608, 4096]

I am confused and not sure what the issue is. Can anyone please guide me?

casper-hansen commented 2 months ago

If you have a normal FP16/BF16 model, this does not happen. I would suggest you check if the model can run inference with Huggingface libraries as a first step

fusesid commented 2 months ago

@casper-hansen Screenshot from 2024-08-13 19-12-53

Yeah, I am able to run inference with huggingface model. As can be seen on the screenshot.

Not sure, what is the issue with converting it into the AWQ format as i want to test AWQ with vLLM.

Important note to be considered is that i have used unsloth for finetuning utilizing LORA and save model using merge_and_unload() method of peftmodel.