IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.81k stars 145 forks source link

running speed slow on NVIDIA vGPU #45

Open foricee opened 8 months ago

foricee commented 8 months ago

I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.

I have noticed that the processing speed of the context and the decoding speed are particularly slow,

The code is nothing special, looks like

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
...