about decode speed and gpu memory usage

OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

MIT License

663 stars 50 forks source link

about decode speed and gpu memory usage #34

Closed tro0o closed 8 months ago

tro0o commented 9 months ago

I used your real_quant parameters to obtain the quantized llama7b model and tested the inference speed on A10. However, for both w4a16 and w4a4, the inference speed was only 7 tokens/s, and it also exceeded 7GB in terms of memory usage. These results differ significantly from the reported results in the paper(100+tokens\s, 5.7GB). Is it possible for you to open-source your testing code?

ChenMnZ commented 9 months ago

If you want to obtain practical memory reduction and speedup, you should leverage mlc-llm, refer https://github.com/OpenGVLab/OmniQuant/blob/main/runing_quantized_models_with_mlc_llm.ipynb for more details.