THUDM / CogVLM

a state-of-the-art-level open visual language model | 多模态预训练模型
Apache License 2.0
6.06k stars 413 forks source link

使用 fp16 训练,merge lora 之后的模型推理结果异常 #456

Open GondorFu opened 6 months ago

GondorFu commented 6 months ago

System Info / 系統信息

版本及硬件按照指示安装

Who can help? / 谁可以帮助到您?

@1049451037

Information / 问题信息

Reproduction / 复现过程

  1. 通过传入 --fp16 使用 fp16 lora 训练
  2. 使用 finetune_cogvlm_demo.py 未merge lora 模型进行推理可以获得正确的结果
  3. 使用 merge lora 模型推理结果异常

Expected behavior / 期待表现

怀疑是 fp16 训练的模型,merge 过程中存在bug,能不能帮忙定位一下问题

elesun2018 commented 6 months ago

异常报错截图 fp16 bf16都可以随时相互转换吧,应该不是数据类型的问题

GondorFu commented 6 months ago

异常报错截图 fp16 bf16都可以随时相互转换吧,应该不是数据类型的问题

没有报错,是推理结果不对,没merge结果是对的,但是merge完推理的结果都是[][][][][][]...

GondorFu commented 6 months ago
training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

if args.use_lora:
    model.get_mixin("lora").merge_lora()
    model.get_mixin("eva").vit_model.get_mixin("lora").merge_lora()
    args.use_lora = False

training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

两个都能正常输出结果,但是上面的结果是正确的,但是下面的结果就是错的?请问一下是什么原因

JBurtn commented 4 months ago

Abnormal error screenshot fp16 bf16 can be converted to each other at any time, it should not be a problem of data type

Afaik, It can be a problem, due to bf16 having a higher range but lower precision.

KevinH48264 commented 3 months ago

Was this ever solved? Also running into this error when trying to just reproduce the CogAgent finetuning results from the official example scripts.

During fine-tuning (finetune_cogagent_demo.py), the predictions are correct, but the merged model has wrong predictions that are completely off during evaluation (merge_model.py and evaluate_cogagent_demo.py).