How to use AutoGPTQ to achieve real quantization?

OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

MIT License

626 stars 49 forks source link

How to use AutoGPTQ to achieve real quantization? #50

Closed AboveParadise closed 4 months ago

AboveParadise commented 6 months ago

I have already installed AutoGPTQ, what is the next step?

ChenMnZ commented 6 months ago

Add --real_quant in your command to achieve real quantization, and add --save_dir SAVE_PATH in your command to save the quantized models. You can also see https://github.com/OpenGVLab/OmniQuant/blob/main/runing_falcon180b_on_single_a100_80g.ipynb or https://github.com/OpenGVLab/OmniQuant/blob/main/runing_quantized_mixtral_7bx8.ipynb for more details about running the really quantized models.

AboveParadise commented 6 months ago

Add --real_quant in your command to achieve real quantization, and add --save_dir SAVE_PATH in your command to save the quantized models. You can also see https://github.com/OpenGVLab/OmniQuant/blob/main/runing_falcon180b_on_single_a100_80g.ipynb or https://github.com/OpenGVLab/OmniQuant/blob/main/runing_quantized_mixtral_7bx8.ipynb for more details about running the really quantized models.

Thanks for your reply! So your work can actually reduce the GPU memory usage, right?

ChenMnZ commented 4 months ago

Yes, with the --real_quant, OmniQuant can actually reduce the memory footprint.