OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

Problems with memory usage and model loading #35

Closed Forival closed 8 months ago

Forival commented 9 months ago

I used the ”fake quant“ code you gave to perform llama quantification. During the model loading stage, I found that all the parameters loaded were of the fp16, and the memory usage was almost the same as the original llama model. May I ask if I use real quant parameters to load? how to get the extremely low memory use in the paper? How to correctly load a quantized model (taking 4bit as an example)?

ChenMnZ commented 9 months ago

If you want to obtain practical memory reduction and speedup, you should leverage mlc-llm, refer https://github.com/OpenGVLab/OmniQuant/blob/main/runing_quantized_models_with_mlc_llm.ipynb for more details.