Aaronhuang-778 / BiLLM

(ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
https://arxiv.org/abs/2402.04291
MIT License
155 stars 12 forks source link

Inference #7

Open diff7 opened 4 months ago

diff7 commented 4 months ago

Hello, I am a little bit confused about efficient inference and kernel implementation for this paper.

Let's sat we use residual quantization $K$ times for some columns or rows. It means we need to multiply these columns with an input vector several times. It affects our latency. Any thoughts on how to improve that.