BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
799 stars 61 forks source link

微调报错 #102

Closed chenzhu005774 closed 1 week ago

chenzhu005774 commented 1 week ago

我在合并 LoRA 权重和基础 LLM后,使用sh script/train/finetune_lora.sh 微调模型报错。错误日志如下。请问这是什么问题,如何修改。

Traceback (most recent call last): File "/home/Bunny-main/bunny/train/train.py", line 393, in train() File "/home/Bunny-main/bunny/train/train.py", line 380, in train non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3( File "/home/Bunny-main/bunny/train/train.py", line 118, in get_peft_state_non_lora_maybe_zero_3 to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()} File "/home/Bunny-main/bunny/train/train.py", line 118, in to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()} File "/home/Bunny-main/bunny/train/train.py", line 81, in maybe_zero_3 with zero.GatheredParameters([param]): File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2178, in exit self.params[0].partition(param_list=self.params, has_been_updated=False) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1329, in partition self._partition(param_list, has_been_updated=has_been_updated) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1478, in _partition self._partition_param(param, has_been_updated=has_been_updated) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _partition_param free_param(param) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 281, in free_param assert not param.ds_active_sub_modules, param.ds_summary() AssertionError: {'id': 451, 'status': 'AVAILABLE', 'numel': 2949120, 'ds_numel': 2949120, 'shape': (2560, 1152), 'ds_shape': (2560, 1152), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2494}, 'ds_tensor.shape': torch.Size([1474560])} Traceback (most recent call last): File "/home/Bunny-main/bunny/train/train.py", line 393, in train() File "/home/Bunny-main/bunny/train/train.py", line 380, in train non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3( File "/home/Bunny-main/bunny/train/train.py", line 118, in get_peft_state_non_lora_maybe_zero_3 to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()} File "/home/Bunny-main/bunny/train/train.py", line 118, in to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()} File "/home/Bunny-main/bunny/train/train.py", line 81, in maybe_zero_3 with zero.GatheredParameters([param]): File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2178, in exit self.params[0].partition(param_list=self.params, has_been_updated=False) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1329, in partition self._partition(param_list, has_been_updated=has_been_updated) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1478, in _partition self._partition_param(param, has_been_updated=has_been_updated) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _partition_param free_param(param) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 281, in free_param assert not param.ds_active_sub_modules, param.ds_summary() AssertionError: {'id': 451, 'status': 'AVAILABLE', 'numel': 2949120, 'ds_numel': 2949120, 'shape': (2560, 1152), 'ds_shape': (2560, 1152), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2494}, 'ds_tensor.shape': torch.Size([1474560])} [2024-07-04 16:56:43,965] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 799 [2024-07-04 16:56:43,966] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 800 [2024-07-04 16:56:43,966] [ERROR] [launch.py:322:sigkill_handler] ['/root/miniconda3/envs/bunny/bin/python', '-u', 'bunny/train/train.py', '--local_rank=1', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', './outmodel', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', './finetune/test.json', '--image_folder', './finetune/', '--vision_tower', '../siglip-so400m', '--mm_projector_type', 'mlp2x_gelu', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'False', '--bf16', 'True', '--output_dir', './checkpoints-phi-2/bunny-lora-phi-2', '--num_train_epochs', '1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'none'] exits with return code = 1

chenzhu005774 commented 1 week ago

我把参数从 --deepspeed ./script/deepspeed/zero3.json \ 修改为 --deepspeed ./script/deepspeed/zero2.json \后能够得到以下微调后的文件。 image

由于我不是非常专业的开发者。我在直接使用 或者 合并的时候都报这个错误: image image

Isaachhh commented 1 week ago

我把参数从 --deepspeed ./script/deepspeed/zero3.json \ 修改为 --deepspeed ./script/deepspeed/zero2.json \后能够得到以下微调后的文件。 image

由于我不是非常专业的开发者。我在直接使用 或者 合并的时候都报这个错误: image image

--model-base ./outmodel

chenzhu005774 commented 1 week ago

我把参数从 --deepspeed ./script/deepspeed/zero3.json \ 修改为 --deepspeed ./script/deepspeed/zero2.json \后能够得到以下微调后的文件。 image 由于我不是非常专业的开发者。我在直接使用 或者 合并的时候都报这个错误: image image

--model-base ./outmodel

感谢 我修改 model-base的参数后能够正常合并也能够使用 image 运行加载模型。 我的合并文件结构如下: image 但是我想通过Transform直接加载模型: image 显示在我合并的目录中缺少了 image image 这两个文件我在源码中cp过来了 image

然后我直接使用Transform加载模型调用: image image

提示没有process_images。这个应当如何修改呢,

Isaachhh commented 1 week ago

The snippet in Quickstart is used for Bunny-v1.0-3B (SigLIP + Phi-2) and so on. We manually combine some configuration code into a single file for users' convenience. Also, you can check modeling_bunny_phi.py and configuration_bunny_phi.py and their related parts in the source code of Bunny to see the difference.

For other models including models trained by yourself, we currently only support loading them with installing source code of Bunny. Or you can copy modeling_bunny_phi.py and configuration_bunny_phi.py into your model and edit config.json.

BTW, offset_bos should be 0 for Phi-2-based Bunny.

chenzhu005774 commented 1 week ago

The snippet in Quickstart is used for Bunny-v1.0-3B (SigLIP + Phi-2) and so on. We manually combine some configuration code into a single file for users' convenience. Also, you can check modeling_bunny_phi.py and configuration_bunny_phi.py and their related parts in the source code of Bunny to see the difference.

For other models including models trained by yourself, we currently only support loading them with installing source code of Bunny. Or you can copy modeling_bunny_phi.py and configuration_bunny_phi.py into your model and edit config.json.

BTW, offset_bos should be 0 for Phi-2-based Bunny.

好的 明白 谢谢了。