Open ChuanhongLi opened 1 week ago
Dear ChuanhongLi: Thanks a lot for your question. We think it is essential to compress the model after quantization. The original popuse of our work is to the utilize the low-precision computing unit of GPU while keeping the accuracy, which means that we want to accerelate the prefill step for long sequence input or to accerelate the decoding step for large batch with INT8/INT4 tensor cores. So we quant a model such as LlaMA 7b (14G) to 8 bit (7G).
However, in order to keep the effeciency for small batch inputs. We also saved a model weight with EETQ (8 bit, 7.5G), which means that the total model is 14.5G.
Both assumptions are based on whether we have a GPU such as 4090, A100, or H100 with enough memory resources. If we need to run the MixQ on a device such as 4080 or 3080. We need to optimize the EETQ model weight~
We think we could achieve this in the later version.
Dear ChuanhongLi: Thanks a lot for your question. We think it is essential to compress the model after quantization. The original popuse of our work is to the utilize the low-precision computing unit of GPU while keeping the accuracy, which means that we want to accerelate the prefill step for long sequence input or to accerelate the decoding step for large batch with INT8/INT4 tensor cores. So we quant a model such as LlaMA 7b (14G) to 8 bit (7G).
However, in order to keep the effeciency for small batch inputs. We also saved a model weight with EETQ (8 bit, 7.5G), which means that the total model is 14.5G.
Both assumptions are based on whether we have a GPU such as 4090, A100, or H100 with enough memory resources. If we need to run the MixQ on a device such as 4080 or 3080. We need to optimize the EETQ model weight~
We think we could achieve this in the later version.
Thanks for your quick reply. In the current implementation, can we choose not to save the model weight with EETQ, just save the original quantized model? I only have a 4080 card with 16G memory (-_-).
Thanks for the excellent work!
I use examples/basic_quant_mix.py to quantize the Qwen2-7B model with --w_bit 8. It's very strange that the quantized model is even larger than the original model.
As far as I know, the purpose of quantization is to reduce the size of the model and thus the consumption of GPU memory, but why is the model larger after MixQ quantization?
Thanks!