OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.95k stars 841 forks source link

[BUG] Lora微调后无法读取模型 #347

Closed ziyinwang98 closed 2 months ago

ziyinwang98 commented 2 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

我最近也遇到了使用官方提供的lora微调代码效果不好的情况,参考了https://github.com/OpenBMB/MiniCPM-V/issues/333这个issue拉取了最新的代码和模型代码,使用最新的finetune代码完成训练后参考finetune readme中的模型读取方式失败。查了一些相同报错的文章好像都和我的情况不一样。 Traceback (most recent call last): File "/home/xxx/MiniCPM-V-new/evaluation_qa_minicpm.py", line 15, in model = AutoPeftModel.from_pretrained( File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/peft/auto.py", line 97, in from_pretrained parent_library = importlib.import_module(parent_library_name) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 992, in _find_and_load_unlocked File "", line 241, in _call_with_frames_removed File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 992, in _find_and_load_unlocked File "", line 241, in _call_with_frames_removed File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'transformers_modules'

期望行为 | Expected Behavior

期望可以正常读取模型推理

复现方法 | Steps To Reproduce

  1. 使用最新代码完成lora训练;
  2. 使用AutoPeftModel.from_pretrained读取训练过程自动保存的checkpoint;
  3. 出现上述报错。

运行环境 | Environment

- OS:Ubuntu 20.04
- Python:3.10.14
- Transformers:4.40.0
- PyTorch:2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

qyc-98 commented 2 months ago

建议重新跑一遍训练然后再按照最新的方式加载

ziyinwang98 commented 2 months ago

建议重新跑一遍训练然后再按照最新的方式加载

您好,感谢回复,我又试了一下用最新更新的finetune和trainer文件训练会直接报错如下: Traceback (most recent call last): File "/home/xxx/MiniCPM-V-new/finetune/finetune.py", line 327, in train() File "/home/xxx/MiniCPM-V-new/finetune/finetune.py", line 317, in train trainer.train() File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/transformers/trainer.py", line 2644, in _maybe_log_save_evaluate tr_loss_scalar = self._nested_gather(tr_loss).mean().item() File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/transformers/trainer.py", line 3756, in _nested_gather tensors = distributed_concat(tensors) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 221, in distributed_concat dist.all_gather(output_tensors, tensor) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, *kwargs) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2806, in all_gather work = default_pg.allgather([tensor_list], [tensor]) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.1 ncclInternalError: Internal check failed. Last error: Socket recv failed while polling for opId=0x7f5e1022b1d0 [2024-07-17 03:53:59,450] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2996165 closing signal SIGTERM [2024-07-17 03:53:59,451] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2996167 closing signal SIGTERM [2024-07-17 03:53:59,451] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2996168 closing signal SIGTERM [2024-07-17 03:54:00,130] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2996166) of binary: /home/xxx/miniforge-pypy3/envs/minicpm/bin/python Traceback (most recent call last): File "/home/xxx/miniforge-pypy3/envs/minicpm/bin/torchrun", line 8, in sys.exit(main()) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, **kwargs) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/xxx/miniforge-pypy3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ziyinwang98 commented 2 months ago

在替换最新的finetune和trainer文件前可正常训练,且可使用最新readme中的方式读取model和lora_model

qyc-98 commented 2 months ago

你先试试单卡训练 看看有没有报错

rgallardone commented 2 months ago

Hi!

I'm having the same issue as @ziyinwang98. I used the latest code for fine-tuning the model and storing the checkpoints, but when I want to load the model using the latest method, I get the error:

ModuleNotFoundError: No module named 'transformers_modules'

Can you please help me?

qyc-98 commented 2 months ago

Do you try in this code? image

rgallardone commented 2 months ago

Hi @qyc-98 ! It worked after I updated to the latest code on the repository.

Thank you for your help!