[BUG] <title>lora微调后，使用官方代码加载模型报错

OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

Apache License 2.0

12.14k stars 849 forks source link

[BUG] <title>lora微调后，使用官方代码加载模型报错 #177

Closed daihuidai closed 4 months ago

daihuidai commented 4 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

abc abcd abcde

lora微调没有报错，生成多个checkpoint，使用官方提供代码读取适配器目录加载模型后报错。

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

OS: Ubuntu 20.04
Python: 3.9
Transformers: 4.40.0
PyTorch: 2.1.0
CUDA: 12.0

备注 | Anything else?

No response

LongIslandWithoutIceTea commented 4 months ago

Hi daihuidai,

I believe you need to fetch the latest modeling_minicpm.py file from HuggingFaces since they recently added support for lora by adding get_input_embedding() and other functions.

Then you can load the checkpoint like I just did.

daihuidai commented 4 months ago

@LongIslandWithoutIceTea Thank you very much for your reminder, it is indeed the problem of code update, I can load the weight saved by the old code after updating the source code, but the new code will appear OOM in training. I don't know what the new code has done to cause the video memory to grow too much during training, and I don't know if you have noticed this problem.

daihuidai commented 4 months ago

@iceflame89 Hello, may I ask why the newly updated code using lora fine-tuning consumes more video memory than the previous lora fine-tuning code? The previous two 3090s can be satisfied, and the OOM will appear after the update.

qyc-98 commented 4 months ago

Hello,

Thank you for raising this issue. We have identified that previously, when calling get_peft_model, the model's parameters, except for the LoRA components, had requires_grad set to False by default. This behavior could inadvertently prevent the VPM and the resampler from participating in the training process when tune_vision was set to True.

We have now corrected this setting to ensure that if tune_vision is set to True, both VPM and resampler are correctly included in the training. This adjustment means there will be additional GPU memory usage, which is the expected overhead from training VPM and the resampler.

Please pull the latest changes from our repository to benefit from this update. If you encounter any further issues or have additional feedback, do not hesitate to reach out.