QwenLM / Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.
7.04k stars 416 forks source link

AWQ 模型的精度和使用 #401

Closed chenchunhui97 closed 3 months ago

chenchunhui97 commented 3 months ago

根据 qwen1.5 提供的几个AWQ 量化后的模型,我看模型权重是 torch.int32 的?怎么这么大而不是 int8 呢?实际计算的时候也是 int32 吗(我在 linear 层里面看到使用的是 torch.float16)我看module的w_bit =4,w_bit=4 怎么会是 int32?运行的时候为什么建议我使用 dtype=float16?vllm 部署如果不指定 dtype=float16, 那么实际计算时是使用什么类型?

jklj077 commented 3 months ago

Hi, it is very common for first-timers to get confused on how quantization techniques work for large language models. Let me try to explain (but in an overly simpler manner).

In general, most quantization techniques quantize the model parameters to lower bits for storage, then in inference, the parameters are converted back to the normal precision (dequantization), and finally the computation is conducted in that precision.

As you can see, in inference, this process includes the extra step of dequantization, and thus quantized models should be slower. That is when the efficient kernels, such as exllama, exllamav2, and merlin, come into place. Those kernels merge the dequantization and computation steps and may even have better speed for small batches, commonly being one in inference. However, for efficient kernels, the limitation is also presented: you need newer devices, not all bit-width is supported, and related computation can run only in fp16.

As to how to store the quantized parameters, since there are also 2bit and 3bit versions, parameters are efficiently packed into larger data types like int32 for storage. The metadata specifying the original quantization scheme (like bit-width used) is stored separately as hyperparameters. The system then intelligently unpacks and interprets these during computation.

In short, storage dtype is unrelated to the actual data type of the quantized parameters; computation run in fp16 no matter what, if efficiency is concerned.