running speed slow on NVIDIA vGPU

I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.

I have noticed that the processing speed of the context and the decoding speed are particularly slow,

context(500 tokens) processing speed: 48 tokens/s
decode speed: 1.6 token/s

Then, I test other model such as https://huggingface.co/ClueAI/ChatYuan-large-v2 and the speed is within expectations. So I guess that GPT-Q does not work well on vGPU？

The code is nothing special, looks like

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
...

IST-DASLab / gptq