OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
689 stars 53 forks source link

Difference between fake quant and real quant #61

Closed YihengBrianWu closed 5 months ago

YihengBrianWu commented 8 months ago

Dear author. Thanks for your amazing job. We are tying to apply this job to our own model. I want to ask what's the difference between fake quant and real quant. The reason I want to ask this is the w3a16 llama2-7b-chat model fake quantized by OmniQuant has a slower inference time than fp16 model by using transformers.

ChenMnZ commented 7 months ago

The real quantization pack the quantized weight with GPTQ kernel, leading real memory saving.

However, the kernel is suboptimal, leading slow inference.