使用 fp16 训练，merge lora 之后的模型推理结果异常

THUDM / CogVLM

a state-of-the-art-level open visual language model | 多模态预训练模型

Apache License 2.0

6.06k stars 413 forks source link

使用 fp16 训练，merge lora 之后的模型推理结果异常 #456

Open GondorFu opened 6 months ago

GondorFu commented 6 months ago

System Info / 系統信息

版本及硬件按照指示安装

Who can help? / 谁可以帮助到您？

@1049451037

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

通过传入 --fp16 使用 fp16 lora 训练
使用 finetune_cogvlm_demo.py 未merge lora 模型进行推理可以获得正确的结果
使用 merge lora 模型推理结果异常

Expected behavior / 期待表现

怀疑是 fp16 训练的模型，merge 过程中存在bug，能不能帮忙定位一下问题

elesun2018 commented 6 months ago

异常报错截图 fp16 bf16都可以随时相互转换吧，应该不是数据类型的问题

GondorFu commented 6 months ago

异常报错截图 fp16 bf16都可以随时相互转换吧，应该不是数据类型的问题

没有报错，是推理结果不对，没merge结果是对的，但是merge完推理的结果都是[][][][][][]...

GondorFu commented 6 months ago

training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

if args.use_lora:
    model.get_mixin("lora").merge_lora()
    model.get_mixin("eva").vit_model.get_mixin("lora").merge_lora()
    args.use_lora = False

training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

两个都能正常输出结果，但是上面的结果是正确的，但是下面的结果就是错的？请问一下是什么原因

JBurtn commented 4 months ago

Abnormal error screenshot fp16 bf16 can be converted to each other at any time, it should not be a problem of data type

Afaik, It can be a problem, due to bf16 having a higher range but lower precision.

KevinH48264 commented 3 months ago

Was this ever solved? Also running into this error when trying to just reproduce the CogAgent finetuning results from the official example scripts.

During fine-tuning (finetune_cogagent_demo.py), the predictions are correct, but the merged model has wrong predictions that are completely off during evaluation (merge_model.py and evaluate_cogagent_demo.py).