Inquiry about Activation Quantization Strategy in Inference

OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

MIT License

663 stars 50 forks source link

Inquiry about Activation Quantization Strategy in Inference #1

Closed lucfisc closed 1 year ago

lucfisc commented 1 year ago

Hi Team,

Great work, really interesting. I was wondering about an aspect in the paper. You are saying that you are using a "per token activation quantization ". Is it dynamic quantization or static at test time ?

In your test with MLC-LLM, you are only benchmarking Weight only quantization and the speed in INT2/ INT3 are worse than INT4. Is it because of constraint of MLC-LLM and that you are only looking for memory reduction ?

Thanks

ChenMnZ commented 1 year ago

Hi, thanks for your interest.

We utilize dynamic quantization for weight-activation quantization during test time.
The speed of weight-only quantization hinges on the executed kernels, making it a significant project. We selected MLC-LLM as it uniquely facilitates INT4/INT3/INT2 quantization. However, MLC-LLM's support for INT3/INT2 is currently suboptimal, particularly for INT3. Enhancements to INT3/INT2 quantization speed are in our future roadmap.

lucfisc commented 1 year ago

Thank you for the prompt answer. Closing the issue !