How to accelerate the inference speed with real_quant

j2kim99 commented 1 month ago

Hi. I am very interested in your paper and try to use it in our project to quantize Llama-2-7b and Llama-2-70b. However, I failed to accelerate the inference speed through packing with real_quant option. (Actually, inference speed gets slower after W4A16 quantization.) I am currently using an A-100 80G and following environmental setup: CUDA: 12.1 torch: 2.2.1+cu121 transformers: 4.36.0 auto-gptq: 0.7.1

I've also tried changing versions of CUDA, torch, transformers, and auto-gptq but it seems packing does not accelerate the speed in any of those configurations. I think kernels are properly installed since kernel_availabilities of auto-gptq in import_utils.py are all set to True except for QIGEN_AVAILABLE, and benchmark token generation speed of the model quantized (and packed) by auto-gptq is not slower than pretrained FP model.

My guess is that my adaptation of your code cannot use the kernel properly because of some version incompatibility. Could you specify the environment in which you've accelerated the model through packing? (i.e. CUDA, torch, transformers, and auto-gptq). I would appreciate all sources of help.

Again, your paper and ideas of Omniquant are amazing, and thank you for sharing your work.

ChenMnZ commented 1 month ago

Hi, thanks for your interesting for our work.

The AutoGPTQ kernel is slow, and just for verifying the memory reduction of quantization.

The speedup reported in our paper is tested through MLC-LLM

j2kim99 commented 1 month ago

Thank you for your quick response. I will try MLC-LLM. Could you share a code snippet to apply MLC-LLM on Omniquant that I can refer to?

j2kim99 commented 1 month ago

I've found that your already uploaded code snippet via .ipynb file. Sorry to bother you

OpenGVLab / OmniQuant

How to accelerate the inference speed with real_quant #80