casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.67k stars 202 forks source link

Vllm AutoAWQ with 4-GPU doesnt utilize GPU #479

Open danielstankw opened 4 months ago

danielstankw commented 4 months ago

I have downloaded a model. Now on my 4 GPU instance I attempt to quantize it using AutoAWQ. Whenever I run the script below I get 0% GPU utilization. Can anyone assist why can this be happening?

import json
from huggingface_hub import snapshot_download
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os

# some other code here
# ////////////////
# some code here

# Load model
model = AutoAWQForCausalLM.from_pretrained(args.model_path, device_map="auto", **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)

# Load quantization config from file
if args.quant_config:
    quant_config = json.loads(args.config)
else:
    # Default quantization config
    print("Using default quantization config")
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# Quantize
print("Quantizing the model")
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model and tokenizer
if args.quant_path:
    print("Saving the model")
    model.save_quantized(args.quant_path)
    tokenizer.save_pretrained(args.quant_path)
else:
    print("No quantized model path provided, not saving quantized model.")

image image

WanBenLe commented 4 months ago

AWQ is a weight quantization-only optimization based on the activation of each layer of the model, so it does not perform high GPU resource calculations ('model.quantize' in your code, but it's still requires GPU memory), so the GPU usage is low, and you can see the usage rate increases when using the model for inference.