THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
4.74k stars 385 forks source link

提示词微调后,模型推理异常 #227

Closed mumu029 closed 3 months ago

mumu029 commented 3 months ago

System Info / 系統信息

model_path = "/home/data/glm-4-9b-chat/"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True, torch_dtype = torch.float16)
peft_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_attention_heads=2,
    token_dim=256,
    prefix_projection=False,
    num_virtual_tokens=100
)
model = get_peft_model(model, peft_config)

generation_config = GenerationConfig(
        max_new_tokens = 64,
        eos_token_id = [151329, 151336, 151338],
        pad_token_id = 151329
    )
train_config = Seq2SeqTrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    max_steps=1000,
    fp16=True,
    learning_rate=5e-4,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    dataloader_num_workers=16,
    per_device_eval_batch_size=2,
    logging_dir="./output",
    log_level="info",
    logging_steps=30,
    evaluation_strategy="steps",
    eval_steps=60,
    save_steps=800,
    predict_with_generate=True,
    remove_unused_columns=False,
    generation_config=generation_config
)
trainer = Seq2SeqTrainer(
        model=model,
        args=train_config,
        train_dataset=train_data,
        eval_dataset=val_data,
        compute_metrics=partial(compute_metrics,tokenizer=tokenizer),
        data_collator=DataCollatorForSeq2Seq(
            tokenizer=tokenizer,
            return_tensors="pt",
            padding="longest"
        )
    )
train.train()

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

adapter_path = "xxxxxxx"
model = AutoPeftModelForCausalLM.from_pretrained(adapter_path, trust_remote_code=True, device_map = "auto", torch_dtype = torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model.peft_config["default"].base_model_name_or_path,trust_remote_code = True)

message = tokenizer.apply_chat_template([{"role" : "user", "content" : "Generate a topic based on the content I gave you.\nContent: This review article provides a comprehensive overview of the state-of-the-art in content-based image and video retrieval (CBIR). It covers the fundamental concepts, advanced techniques, and system design aspects of CBIR, with a focus on recent advancements and future directions. The article discusses the evolution of CBIR systems from early methods to sophisticated techniques, emphasizing the role of deep learning and neural networks. It also addresses the challenges of performance evaluation, the significance of advanced descriptors, and the impact of system architecture and databases. Furthermore, the review explores the trends and future directions in CBIR, including the bridging of the semantic gap, the integration of cross-modal retrieval, and the potential for CBIR to be integrated with other AI technologies. The article aims to serve as a reference for researchers and practitioners in the field, highlighting the dynamic nature of CBIR and its potential to shape the future of multimedia information retrieval."}],
                                        add_generation_prompt=True,
                                        return_tensors="pt",
                                        return_dict=True,
                                        tokenize=True).to(device)

output = model.generate(**message,max_new_tokens=64,do_sample=True,top_p=0.8,temperature=0.8,repetition_penalty= 1.2,eos_token_id = model.config.eos_token_id)
# temp = output[0][len(message[0]):]
temp = output[0][len(message["input_ids"][0]):]
print(temp)
output_text = tokenizer.decode(temp)
print(type(output_text))
print(output_text)

Expected behavior / 期待表现

预期:Advancements and Future Directions in Content-Based Image and Video Retrieval: A Comprehensive Review 实际:胡言乱语

mumu029 commented 3 months ago

我在使用P-tuning V2微调GLM时,loss降低的很明显,但实际推理时就胡乱说。我推理时用的还是训练数据。 训练数据实例:

{'input_ids': [151331, 151333, 151336, 198, 30989, 264, 8543, 3118, 389, 279, 2213, 358, 6551, 498, 624, 2762, 25, 1986, 3395, 4549, 6081, 264, 15805, 23503, 315, 279, 86117, 323, 4763, 18384, 11, 18173, 87497, 18906, 11, 88390, 25107, 11, 31652, 11, 323, 10654, 13, 1084, 40102, 279, 14964, 8356, 315, 86117, 304, 5257, 30358, 11, 2670, 9433, 24162, 11, 6371, 4763, 11, 323, 17917, 5440, 13, 576, 4549, 1083, 14220, 88390, 8775, 11, 5546, 2660, 23406, 11, 12339, 10515, 11, 5777, 323, 4842, 4714, 11, 323, 3853, 18310, 13, 3216, 8240, 458, 304, 30193, 6358, 315, 1493, 13557, 11, 279, 3395, 71242, 279, 9020, 3476, 315, 86117, 304, 47669, 287, 1995, 323, 12339, 304, 279, 7377, 4231, 13, 151337, 198, 34, 46376, 323, 8233, 25, 362, 66376, 10289, 315, 85052, 11, 11702, 21964, 11, 323, 68179, 151329], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 198, 34, 46376, 323, 8233, 25, 362, 66376, 10289, 315, 85052, 11, 11702, 21964, 11, 323, 68179, 151329]}
[gMASK] <sop> <|user|> 
Generate a topic based on the content I gave you.
Content:This review article offers a comprehensive examination of the cryptography and security landscape, covering foundational concepts, cryptographic algorithms, protocols, and standards. It explores the practical applications of cryptography in various domains, including cloud computing, mobile security, and blockchain technology. The article also addresses cryptographic attacks, countermeasures, privacy concerns, legal and policy issues, and future trends. By providing an in-depth analysis of these aspects, the review underscores the critical role of cryptography in safeguarding information and privacy in the digital age. <|assistant|> 
Cryptography and Security: A Comprehensive Review of Algorithms, Protocols, and Challenges <|endoftext|>
zRzRzRzRzRzRzR commented 3 months ago

或许没有正常载入模型,或者在训练的时候试试把<|endoftext|>换成<|user|> 这个格式看上去是没有问题,暂时也check不出来,lora能正常吗

mumu029 commented 3 months ago

或许没有正常载入模型,或者在训练的时候试试把<|endoftext|>换成<|user|> 这个格式看上去是没有问题,暂时也check不出来,lora能正常吗

非常感谢你的回答。

  1. 如果没有正常载入,你认为会是什么原因导致的,或者怎么判断是否正常载入呢。我尝试使用没有微调的模型进行推理,是可以正常输出的。但一旦加上微调过的提示词,就开始胡乱输出。
  2. ”把<|endoftext|>换成<|user|>“这个思路我也想过,但我发现模型在推理时的输出并不是输出<|endoftext|>后又继续数据,而是根本不会输出这些特殊token,因此我猜想可能不会是这原因导致的。当然大模型具有不可解释性,我接下来会去尝试一下”把<|endoftext|>换成<|user|>“。
  3. 我再补充一下我的实验环境:我总共就42条训练数据(我不知道42条数据搭配100个virtual tokens是否合理,但如果我调小virtual tokens就会遇到下面的问题);此外在微调过程中偶尔会遇到grad norm等于nan的情况(此时的loss可能会直接等于0,或者没有影响),遇到这种情况后,我要么调大num virtual token要么从新微调,可以“解决”这个意料之外的问题。
  4. 我询问其他人后,有人说是灾难性遗忘,于是我又在训练集中加入了一些常识问题。重新尝试后依然不行。
  5. 我还没有尝试过lora。

image

zRzRzRzRzRzRzR commented 3 months ago

我估计,你用了fp16训练而不是bf16,这个模型一定要bf16训练