OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.97k stars 558 forks source link

[BUG] 基于坐标数据微调后y坐标偏移 #203

Closed Zmeo closed 1 month ago

Zmeo commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

我基于原始模型首先进行了coco-en-2-mini的微调,随后基于该版本进行了带坐标数据的微调(坐标都是归一化到0-999),训练数据格式为:

{query}.\n{文本增强信息}.\n{图标增强信息}.\nAnswer:

示例如下:

{
    "id": "rec_tepiep_240948_1",
    "image": "/home/kas/lx/data/datasets/app_store_images/crop_norm_all/4646152_t015b78674744e9070d.jpg",
    "conversations": [
        {
            "role": "user",
            "content": "请点击秋果果实左边的CAMERA图标.\n以下是图片中可参考文本的坐标: '待收货'(611,383),'实名认证'(841,497),'待付款'(199,382),'更多功能'(185,437),'秋果果实'(177,175),'邀请好友'(675,498),'我的粉丝'(511,497),'秋果'(326,237),'元宇宙空间'(212,873),'我的钱包'(177,497),'一精选推荐一'(510,611),'静态商城'(509,872).\n以下是图片中可参考图标的坐标: 'CAMERA'(207,853),'CHECK'(612,357),'LOCATION'(341,472),'UPLOAD'(172,542),'CART'(824,177),'ARROW_RIGHT'(886,320).\nAnswer:"
        },
        {
            "role": "assistant",
            "content": "<pt>(207,853)</pt>"
        }
    ]
}

对应图片如下: 4646152_t015b78674744e9070d

自定义数据量约55MB,在完成微调后,用web_demo进行推理,prompt为:

请点击LOCATION图标.\n以下是图片中可参考文本的坐标: '待收货'(611,383),'实名认证'(841,497),'待付款'(199,382),'更多功能'(185,437),'秋果果实'(177,175),'邀请好友'(675,498),'我的粉丝'(511,497),'秋果'(326,237),'元宇宙空间'(212,873),'我的钱包'(177,497),'一精选推荐一'(510,611),'静态商城'(509,872).\n以下是图片中可参考图标的坐标: 'CAMERA'(207,853),'CHECK'(612,357),'LOCATION'(341,472),'UPLOAD'(172,542),'CART'(824,177),'ARROW_RIGHT'(886,320).\nAnswer:

answer为:<pt>(341,506)</pt>

表现为: 20240603145130

可以看到和训练数据的图标增强信息y坐标有明显偏移:'LOCATION'(341,472) -> (341,506) 这种情况多次发生,相较于返回坐标不准、数据格式不符合预期等问题,y坐标偏移格外突出,例如:

请点击秋果.\n以下是图片中可参考文本的坐标: '待收货'(611,383),'实名认证'(841,497),'待付款'(199,382),'更多功能'(185,437),'秋果果实'(177,175),'邀请好友'(675,498),'我的粉丝'(511,497),'秋果'(326,237),'元宇宙空间'(212,873),'我的钱包'(177,497),'一精选推荐一'(510,611),'静态商城'(509,872).\n以下是图片中可参考图标的坐标: 'CAMERA'(207,853),'CHECK'(612,357),'LOCATION'(341,472),'UPLOAD'(172,542),'CART'(824,177),'ARROW_RIGHT'(886,320).\nAnswer:

(326,190)

20240603145139

我在其他推理中(训练和未训练)都发现了这种情况,由于y坐标偏移实在明显,想询问是否有未知bug,或者训练参数不对? coco训练方式为: CUDA_VISIBLE_DEVICES=0 swift sft \ --model_type minicpm-v-v2-chat \ --dataset coco-en-2-mini \

CUDA_VISIBLE_DEVICES=0 swift export \ --ckpt_dir output/minicpm-v-v2-chat/vx-xxx/checkpoint-xxx \ --merge_lora true

自定义数据训练参数为:

!/bin/bash

GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 LLM_TYPE="minicpm" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm

MODEL="/home/kas/lx/code/MiniCPM-V/finetune/output-coco/v0-20240531-183954/minicpm-v-v2/v0-20240531-191653/checkpoint-2506-merged"

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

DATA="/home/kas/lx/code/MiniCPM-V/finetune/data/04_19_all_order_upload.json"

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " /home/kas/.conda/envs/vllm/bin/python -m torch.distributed.run $DISTRIBUTED_ARGS /home/kas/lx/code/MiniCPM-V/finetune/finetune.py \ --model_name_or_path $MODEL \ --llm_type $LLM_TYPE \ --data_path $DATA \ --bf16 True \ --remove_unused_columns false \ --num_train_epochs 1 \ --report_to "tensorboard" \ --output_dir /home/kas/lx/code/MiniCPM-V/finetune/output/04_19_all_order_upload_based_coco \ --logging_dir /home/kas/lx/code/MiniCPM-V/finetune/output/logging \ --logging_strategy "steps" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 100 \ --learning_rate 1e-6 \ --model_max_length 2048 \ --gradient_checkpointing True \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.02 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --deepspeed /home/kas/lx/code/MiniCPM-V/finetune/ds_config_zero2_llava.json

ds配置为:

{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false, "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Cuiunbo commented 1 month ago
image

👋嗨喽,很高兴您关注我们模型并且尝试微调,我们已经尝试复现了一下您问题中给出的case,在直接推理的情况下发现是没有坐标偏差的~ @Zmeo

Zmeo commented 1 month ago

@Cuiunbo 您好,我用的模型是MiniCPM-V 2.0,用online demo试了您的prompt之后没有给出明确的定位 xiezuo20240605-120451 在我的观察中,用训练数据直接推理是相对较准的,但是训练数据外的自然语言描述就会有所偏移

Cuiunbo commented 1 month ago

您好,一般情况下如果sft训练数据过于单一的话,会使模型对于丰富自然语言描述的跟随能力下降。如您的场景是需要用丰富自然语言交互的场景,可尝试使用语言模型对您的数据进行多样性润色