Open danielstankw opened 6 months ago
AWQ is a weight quantization-only optimization based on the activation of each layer of the model, so it does not perform high GPU resource calculations ('model.quantize' in your code, but it's still requires GPU memory), so the GPU usage is low, and you can see the usage rate increases when using the model for inference.
I have downloaded a model. Now on my 4 GPU instance I attempt to quantize it using AutoAWQ. Whenever I run the script below I get 0% GPU utilization. Can anyone assist why can this be happening?