OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.61k stars 815 forks source link

[BUG] 微调后推理报错 #175

Closed Zmeo closed 3 months ago

Zmeo commented 3 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

微调后使用代码推理报错:

from chat import MiniCPMVChat, img2base64 import torch import json from peft import AutoPeftModelForCausalLM

model_path = "/home/kas/lx/code/MiniCPM-V/finetune/output/3d-100-new/output_minicpmv2" image_path = '/home/kas/lx/data/datasets/app_store_images/crop_norm_all/2368107_t0171411071ab005364.jpg'

torch.manual_seed(0)

chat_model = MiniCPMVChat(model_path)

im_64 = img2base64(image_path)

这里开头的\被md掉了,代码中是\\n定位寻城记....

question = "\n定位寻城记.\n以下是图片中可参考文本的坐标: '丁丁和爸爸'(346,615),'第二期'(123,727),'垃圾分类奇遇记'(382,855),'悦听'(365,083),'今晚,我们都是'(226,325),'新闻'(115,927),'视频'(508,927),'服务'(311,927),'寻城记'(135,551).\nAnswer:" msgs = [{'role': 'user', 'content': question}]

inputs = {"image": im_64, "question": json.dumps(msgs)} answer = chat_model.chat(inputs) print(answer)

报错信息:

Loading checkpoint shards: 100%|███████████████████████████████████████████| 2/2 [02:01<00:00, 60.98s/it] Traceback (most recent call last): File "/home/kas/lx/code/MiniCPM-V/demo.py", line 20, in answer = chatmodel.chat(inputs) File "/home/kas/lx/code/MiniCPM-V/chat.py", line 197, in chat return self.model.chat(input) File "/home/kas/lx/code/MiniCPM-V/chat.py", line 153, in chat answer, context, = self.model.chat( File "/home/kas/.cache/huggingface/modules/transformers_modules/output_minicpmv2/modeling_minicpmv.py", line 355, in chat res, vision_hidden_states = self.generate( File "/home/kas/.cache/huggingface/modules/transformers_modules/output_minicpmv2/modeling_minicpmv.py", line 266, in generate model_inputs = self._process_list(tokenizer, data_list, max_inp_length) File "/home/kas/.cache/huggingface/modules/transformers_modules/output_minicpmv2/modeling_minicpmv.py", line 183, in _process_list self._convert_to_tensors(tokenizer, data, max_inp_length) File "/home/kas/.cache/huggingface/modules/transformers_modules/output_minicpmv2/modeling_minicpmv.py", line 163, in _convert_to_tensors image_bound = torch.hstack( RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

训练sh脚本为:

!/bin/bash

GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 LLM_TYPE="minicpm" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm

MODEL="/home/kas/lx/code/MiniCPM-V/openbmb/MiniCPM-V-2"

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

DATA="/home/kas/lx/code/MiniCPM-V/finetune/data/05_16_add_05_13_duplicate_three_dots.json"

EVAL_DATA="/home/kas/lx/code/MiniCPM-V/finetune/data/05_16_add_05_13_duplicate_three_dots.json"

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "

rm -rf /home/kas/lx/code/MiniCPM-V/finetune/output

/home/kas/.conda/envs/vllm/bin/python -m torch.distributed.run $DISTRIBUTED_ARGS /home/kas/lx/code/MiniCPM-V/finetune/finetune.py \ --model_name_or_path $MODEL \ --llm_type $LLM_TYPE \ --data_path $DATA \ --bf16 True \ --remove_unused_columns false \ --num_train_epochs 1 \ --report_to "tensorboard" \ --output_dir /home/kas/lx/code/MiniCPM-V/finetune/output/3d-100-new/output_minicpmv2 \ --logging_dir /home/kas/lx/code/MiniCPM-V/finetune/output/logging \ --logging_strategy "steps" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 40 \ --save_total_limit 100 \ --learning_rate 1e-6 \ --model_max_length 2048 \ --gradient_checkpointing True \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.02 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --deepspeed /home/kas/lx/code/MiniCPM-V/finetune/ds_config_zero2_llava.json

ds配置为:

{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false, "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }

训练数据格式为:

[ { "id": "text_norm_14526_4", "image": "/home/kas/lx/data/datasets/app_store_images/crop_norm_all/4598428_t0199ccee4527b00e55.jpg", "conversations": [ { "role": "user", "content": "\n帮我点击魅力偶像画画.\nAnswer:" }, { "role": "assistant", "content": "(849,432)" } ] }, { "id": "icon_norm_10306_1", "image": "/home/kas/lx/data/datasets/app_store_images/crop_norm_all/4589255_t0129b792cdb55df220.jpg", "conversations": [ { "role": "user", "content": "\n帮我找到ARROW_RIGHT图标.\n以下是图片中可参考图标的坐标: 'ARROW_RIGHT'(916,304),'STAR'(942,022),'CALL'(362,946),'ARROW_LEFT'(060,021),'STAR'(070,948),'UPLOAD'(217,948).\nAnswer:" }, { "role": "assistant", "content": "(916,304)" } ] } ]

数据内图片分辨率为: 610 x 1344

关键库版本为: deepspeed 0.14.2 torch 2.3.0 transformers 4.40.2

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

zhudongwork commented 3 months ago

大佬,怎么解决的

Zmeo commented 3 months ago

大佬,怎么解决的

把推理的\n去掉了,因为我只需要在最开始插入图片,所以我后续把训练推理的\n都去掉了

Zmeo commented 3 months ago

大佬,怎么解决的

\<image>\n ,上一条回复的\<image>被md5吞了

lyc728 commented 3 months ago

AutoPeftModelForCausalLM

你这是全量微调吗?