How did you compress model by using awq for 3bit and 8bit?

coolknow commented 2 months ago

When I try to use the method in llm-awq to quantize models by seeting w_bit = 3,4 and 8 bit, I found that it only worked for 4-bit. So, I want to clarify the way you compressed the models.

Should I follow the instruction on huggingface for 3bit and 8bit?

https://huggingface.co/compressed-llm/vicuna-13b-v1.3-awq

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig from auto_gptq import AutoGPTQForCausalLM import torch

model_path = 'efficient-llm/vicuna-13b-v1.3-awq' config = AutoConfig.from_pretrained(model_path, revision='3bit_128g', trust_remote_code=True) enc = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', trust_remote_code=True) kwargs = {"torch_dtype": torch.float16, "low_cpu_mem_usage": True} model = AutoModelForCausalLM.from_pretrained( model_path, config=config, trust_remote_code=True, device_map='auto', revision='3bit_128g', **kwargs)

model.eval() input_ids = enc('How are you today?', return_tensors='pt').input_ids.to('cuda') outputs = model.generate(input_ids=input_ids, max_length=128) print(enc.decode(outputs[0]))

jyhong836 commented 2 months ago

Are you asking about compressing or loading 3bit AWQ models?

The instruction in https://huggingface.co/compressed-llm/vicuna-13b-v1.3-awq is for loading not for compressing.

coolknow commented 2 months ago

compressing. llm-awq can only compress 4-bit case. So, may I ask how to compress 3-bit and 8-bit case?

screenshot is from qmodule.py in llm-awq.

What I want to know is, when you conducted experiments, did you compress the model yourself or load the quantized model according to huggingface's method?

If the model on huggingface has been compressed by you, how did you compress it? Because llm-awq does not support 3bit and 8bit quantization

Hope to receive your reply, thank you!

jinhaoduan commented 2 months ago

Hi @coolknow, thanks for reaching out. All the models were quantized/compressed by us. For AWQ quantization, we used fake quantization (args.q_backend='fake' in here), which will not trigger the 4-bit constraint. Here are the cmd we used for llm-awq quantization:

CUDA_VISIBLE_DEVICES=0 python -m awq.entry --model_path meta-llama/Llama-2-13b-chat-hf \
    --w_bit 3 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/model.pt

coolknow commented 2 months ago

Thank you so much. The fake quantization method works well.

decoding-comp-trust / comp-trust

How did you compress model by using awq for 3bit and 8bit? #2