AWQ 模型的精度和使用

Hi, it is very common for first-timers to get confused on how quantization techniques work for large language models. Let me try to explain (but in an overly simpler manner).

In general, most quantization techniques quantize the model parameters to lower bits for storage, then in inference, the parameters are converted back to the normal precision (dequantization), and finally the computation is conducted in that precision.

As you can see, in inference, this process includes the extra step of dequantization, and thus quantized models should be slower. That is when the efficient kernels, such as exllama, exllamav2, and merlin, come into place. Those kernels merge the dequantization and computation steps and may even have better speed for small batches, commonly being one in inference. However, for efficient kernels, the limitation is also presented: you need newer devices, not all bit-width is supported, and related computation can run only in fp16.

As to how to store the quantized parameters, since there are also 2bit and 3bit versions, parameters are efficiently packed into larger data types like int32 for storage. The metadata specifying the original quantization scheme (like bit-width used) is stored separately as hyperparameters. The system then intelligently unpacks and interprets these during computation.

In short, storage dtype is unrelated to the actual data type of the quantized parameters; computation run in fp16 no matter what, if efficiency is concerned.

QwenLM / Qwen2

AWQ 模型的精度和使用 #401