Open foricee opened 8 months ago
I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.
I have noticed that the processing speed of the context and the decoding speed are particularly slow,
decode speed: 1.6 token/s
Then, I test other model such as https://huggingface.co/ClueAI/ChatYuan-large-v2 and the speed is within expectations. So I guess that GPT-Q does not work well on vGPU?
The code is nothing special, looks like
from auto_gptq import AutoGPTQForCausalLM model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") ...
I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.
I have noticed that the processing speed of the context and the decoding speed are particularly slow,
decode speed: 1.6 token/s
Then, I test other model such as https://huggingface.co/ClueAI/ChatYuan-large-v2 and the speed is within expectations. So I guess that GPT-Q does not work well on vGPU?
The code is nothing special, looks like