Closed coolknow closed 2 months ago
Are you asking about compressing or loading 3bit AWQ models?
The instruction in https://huggingface.co/compressed-llm/vicuna-13b-v1.3-awq is for loading not for compressing.
compressing. llm-awq can only compress 4-bit case. So, may I ask how to compress 3-bit and 8-bit case?
screenshot is from qmodule.py in llm-awq.
What I want to know is, when you conducted experiments, did you compress the model yourself or load the quantized model according to huggingface's method?
If the model on huggingface has been compressed by you, how did you compress it? Because llm-awq does not support 3bit and 8bit quantization
Hope to receive your reply, thank you!
Hi @coolknow, thanks for reaching out. All the models were quantized/compressed by us. For AWQ quantization, we used fake quantization (args.q_backend='fake'
in here), which will not trigger the 4-bit constraint. Here are the cmd we used for llm-awq quantization:
CUDA_VISIBLE_DEVICES=0 python -m awq.entry --model_path meta-llama/Llama-2-13b-chat-hf \
--w_bit 3 --q_group_size 128 \
--run_awq --dump_awq awq_cache/model.pt
Thank you so much. The fake quantization method works well.
When I try to use the method in llm-awq to quantize models by seeting w_bit = 3,4 and 8 bit, I found that it only worked for 4-bit. So, I want to clarify the way you compressed the models.
Should I follow the instruction on huggingface for 3bit and 8bit?
https://huggingface.co/compressed-llm/vicuna-13b-v1.3-awq
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig from auto_gptq import AutoGPTQForCausalLM import torch
model_path = 'efficient-llm/vicuna-13b-v1.3-awq' config = AutoConfig.from_pretrained(model_path, revision='3bit_128g', trust_remote_code=True) enc = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', trust_remote_code=True) kwargs = {"torch_dtype": torch.float16, "low_cpu_mem_usage": True} model = AutoModelForCausalLM.from_pretrained( model_path, config=config, trust_remote_code=True, device_map='auto', revision='3bit_128g', **kwargs)
model.eval() input_ids = enc('How are you today?', return_tensors='pt').input_ids.to('cuda') outputs = model.generate(input_ids=input_ids, max_length=128) print(enc.decode(outputs[0]))