量化 lora 微调后的qwen2-72b 为4bit的模型

lijiayi980130 commented 1 month ago

我想量化经过lora微调后的qwen2-72b模型，按照官方教程指定的GPTQ量化方式进行量化

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from transformers import AutoTokenizer

model_path = "/data/lijy/codes/qwen2-72b-instruct/output_qwen/checkpoint-1000" quant_path = "/data/lijy/codes/qwen2-72b-instruct/qwen2-72b-4bit_model" quantize_config = BaseQuantizeConfig( bits=4, # 4 or 8 group_size=128, damp_percent=0.01, desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad static_groups=False, sym=True, true_sequential=True, model_name_or_path=None, model_file_base_name="model" ) max_len = 512

tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config,max_memory={i:"80GB" for i in range(4)})

import torch messages = [[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me who you are."}, {"role": "assistant", "content": "I am a large language model named Qwen..."} ]] data = [] for msg in messages: text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False) model_inputs = tokenizer([text]) input_ids = torch.tensor(model_inputs.input_ids[:max_len], dtype=torch.int) data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer.pad_token_id)))

model.quantize(data, cache_examples_on_gpu=False)

model.save_quantized(quant_path, use_safetensors=True) tokenizer.save_pretrained(quant_path)

量化之后，推理

`device = "cuda" # the device to load the model onto model_path = "/data/lijy/codes/qwen2-72b-instruct/qwen2-72b-4bit_model" model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_path) result = []

with open("./data/qwen_test.jsonl", "r", encoding="utf-8") as f: for line in tqdm(f.readlines()): example = json.loads(line) messages = example["messages"][:2] target = example["messages"][-1]["content"] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) print(model_inputs.input_ids) generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=20) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] print(generated_ids) response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

输出的response全是"!!!!!!!!!!!!!!!!!!!" 打印出来generated_ids全是0 请问是什么原因？？？？

jklj077 commented 1 month ago

Hi, GPTQ requires a calibration dataset (hundreds of examples should be adequate), and which one has you used?

lijiayi980130 commented 1 month ago

嗨，GPTQ 需要一个校准数据集（数百个示例应该足够了），您用过哪一个？

推理测试还是输出“！！！！！！！！！！！！！”，麻烦能不能解答下

这是量化脚本，用了微调数据集中的数据做校准数据集 `

lora_data_ 是lora 微调采用的数据集，大概有10w条数据，我们选择其中的1w条作为校准集
"""
print(lora_data_[:10])
messages = []
for item in lora_data_[25000:35000]:
    messages.append(item)

data = []
for msg in messages:
    text = tokenizer_.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
    model_inputs = tokenizer_([text])
    input_ids = torch.tensor(model_inputs.input_ids[:max_len_], dtype=torch.int)
    #print('input_ids=',input_ids)
    #sys.exit()
    data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer_.pad_token_id)))

return data

# 加载校准集
lora_datas = []
with open(quantize_dataset_path, 'r', encoding='utf-8') as f:
    for line in f:
        lora_data = json.loads(line)
        lora_datas.append(lora_data)

# 最大输入token数量，超出截断
max_len = 512
quantize_config = BaseQuantizeConfig(
        # 有时fp16比量化后int4要快，这是因为原来有针对fp16的优化策略，在int4量化后无法使用，导致变慢
        bits=4,  # 4 or 8
        group_size=128,
        # 阻尼系数，用于量化过程中减少量化带来的震荡，例如，一个组中前一个量化损失小，后一个大，
        # 这参数大一点，那么前后两次量化损失差值就会小一点， 有什么效果呢？
        damp_percent=0.01,
        desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
        # 是否使用静态组， 静态组简化计算，但精度下降
        static_groups=False,
        # 是否对称量化
        sym=True,
        # 是否使用真正的序列量化，True可以调高量化精度，但会增加计算量
        true_sequential=True,
        model_name_or_path=None,
        # 输出的权重，命名为model
        model_file_base_name="model"
)
# qwen1.5不再需要trust_remote_code=True,或许其他大模型需要吧
tokenizer = AutoTokenizer.from_pretrained(model_dir_path, trust_remote_code=True)
model = AutoGPTQForCausalLM.from_pretrained(
        model_dir_path,
        quantize_config,
        device_map="auto",
        # max_memory={i:"20GB" for i in range(4)}, # 用多GPU来读取模型, 与device_map二选一
        trust_remote_code=True
)

data = qwen_preprocess(lora_datas, tokenizer, max_len)

# cache_examples_on_gpu:中间量化缓存是否保存在gpu上,如果显存小,设为false. use_triton:使用triton加速包
model.quantize(data, cache_examples_on_gpu=False, batch_size=1, use_triton=True)

model.save_quantized(quantized_path, use_safetensors=True)
tokenizer.save_pretrained(quantized_path)`

jklj077 commented 1 month ago

Hi, "!!!!" means nan or inf is encountered in calculation, which can be caused by multiple, vastly different things.

For quantized models, it is likely the quantization itself failed and an unusable model was produced. However, in our experience, it only happens to the smaller models, Qwen2-1.5B-Instructa and Qwen2-7B-Instruct, and the 8bit GPTQ quantization (if inference is conducted using transformers+auto_gptq, and vllm still works fine). The Qwen2-72B-Instruct models are stable in this regard. But since you were using your own finetuned model, we don't really know. The only major difference I can see with our own is that you have set use_triton=True.

Here are some things to be checked:

does the original model before LoRA finetuning works after your own quantization?
does your quantized model work with vllm?
if use_triton=False, does the quantized model work?

There are other possible reasons including incompatible package versions, buggy nvidia drivers, or broken cards.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

QwenLM / Qwen2

量化 lora 微调后的qwen2-72b 为4bit的模型 #764