IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.95k stars 155 forks source link

How should I verify the speedup effect of the algorithm? #38

Open moonlightian opened 1 year ago

moonlightian commented 1 year ago

Hi~ Thank you for your great works! It seems that GPTQ would lead to significant speedups for end-to-end inference. But after quantizing INT8 BLOOM-7B with GPTQ, I found it twice slower than FP16 model. How could I make it speedup as shown in paper? image