[BUG] 基于坐标数据微调后y坐标偏移

Zmeo commented 1 month ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

我基于原始模型首先进行了coco-en-2-mini的微调，随后基于该版本进行了带坐标数据的微调（坐标都是归一化到0-999），训练数据格式为：

{query}.\n{文本增强信息}.\n{图标增强信息}.\nAnswer:

示例如下：

{
    "id": "rec_tepiep_240948_1",
    "image": "/home/kas/lx/data/datasets/app_store_images/crop_norm_all/4646152_t015b78674744e9070d.jpg",
    "conversations": [
        {
            "role": "user",
            "content": "请点击秋果果实左边的CAMERA图标.\n以下是图片中可参考文本的坐标: '待收货'(611,383),'实名认证'(841,497),'待付款'(199,382),'更多功能'(185,437),'秋果果实'(177,175),'邀请好友'(675,498),'我的粉丝'(511,497),'秋果'(326,237),'元宇宙空间'(212,873),'我的钱包'(177,497),'一精选推荐一'(510,611),'静态商城'(509,872).\n以下是图片中可参考图标的坐标: 'CAMERA'(207,853),'CHECK'(612,357),'LOCATION'(341,472),'UPLOAD'(172,542),'CART'(824,177),'ARROW_RIGHT'(886,320).\nAnswer:"
        },
        {
            "role": "assistant",
            "content": "<pt>(207,853)</pt>"
        }
    ]
}

对应图片如下： 4646152_t015b78674744e9070d

自定义数据量约55MB，在完成微调后，用web_demo进行推理，prompt为：

请点击LOCATION图标.\n以下是图片中可参考文本的坐标: '待收货'(611,383),'实名认证'(841,497),'待付款'(199,382),'更多功能'(185,437),'秋果果实'(177,175),'邀请好友'(675,498),'我的粉丝'(511,497),'秋果'(326,237),'元宇宙空间'(212,873),'我的钱包'(177,497),'一精选推荐一'(510,611),'静态商城'(509,872).\n以下是图片中可参考图标的坐标: 'CAMERA'(207,853),'CHECK'(612,357),'LOCATION'(341,472),'UPLOAD'(172,542),'CART'(824,177),'ARROW_RIGHT'(886,320).\nAnswer:

answer为：<pt>(341,506)</pt>

表现为： 20240603145130

可以看到和训练数据的图标增强信息y坐标有明显偏移：'LOCATION'(341,472) -> (341,506) 这种情况多次发生，相较于返回坐标不准、数据格式不符合预期等问题，y坐标偏移格外突出，例如：

请点击秋果.\n以下是图片中可参考文本的坐标: '待收货'(611,383),'实名认证'(841,497),'待付款'(199,382),'更多功能'(185,437),'秋果果实'(177,175),'邀请好友'(675,498),'我的粉丝'(511,497),'秋果'(326,237),'元宇宙空间'(212,873),'我的钱包'(177,497),'一精选推荐一'(510,611),'静态商城'(509,872).\n以下是图片中可参考图标的坐标: 'CAMERA'(207,853),'CHECK'(612,357),'LOCATION'(341,472),'UPLOAD'(172,542),'CART'(824,177),'ARROW_RIGHT'(886,320).\nAnswer:
(326,190)

20240603145139

我在其他推理中（训练和未训练）都发现了这种情况，由于y坐标偏移实在明显，想询问是否有未知bug，或者训练参数不对？ coco训练方式为： CUDA_VISIBLE_DEVICES=0 swift sft \ --model_type minicpm-v-v2-chat \ --dataset coco-en-2-mini \

CUDA_VISIBLE_DEVICES=0 swift export \ --ckpt_dir output/minicpm-v-v2-chat/vx-xxx/checkpoint-xxx \ --merge_lora true

自定义数据训练参数为：

!/bin/bash

GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 LLM_TYPE="minicpm" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm

MODEL="/home/kas/lx/code/MiniCPM-V/finetune/output-coco/v0-20240531-183954/minicpm-v-v2/v0-20240531-191653/checkpoint-2506-merged"

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

DATA="/home/kas/lx/code/MiniCPM-V/finetune/data/04_19_all_order_upload.json"

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " /home/kas/.conda/envs/vllm/bin/python -m torch.distributed.run $DISTRIBUTED_ARGS /home/kas/lx/code/MiniCPM-V/finetune/finetune.py \ --model_name_or_path $MODEL \ --llm_type $LLM_TYPE \ --data_path $DATA \ --bf16 True \ --remove_unused_columns false \ --num_train_epochs 1 \ --report_to "tensorboard" \ --output_dir /home/kas/lx/code/MiniCPM-V/finetune/output/04_19_all_order_upload_based_coco \ --logging_dir /home/kas/lx/code/MiniCPM-V/finetune/output/logging \ --logging_strategy "steps" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 100 \ --learning_rate 1e-6 \ --model_max_length 2048 \ --gradient_checkpointing True \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.02 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --deepspeed /home/kas/lx/code/MiniCPM-V/finetune/ds_config_zero2_llava.json

ds配置为：

{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false, "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Cuiunbo commented 1 month ago

👋嗨喽，很高兴您关注我们模型并且尝试微调，我们已经尝试复现了一下您问题中给出的case，在直接推理的情况下发现是没有坐标偏差的～ @Zmeo

Zmeo commented 1 month ago

@Cuiunbo 您好，我用的模型是MiniCPM-V 2.0，用online demo试了您的prompt之后没有给出明确的定位 xiezuo20240605-120451 在我的观察中，用训练数据直接推理是相对较准的，但是训练数据外的自然语言描述就会有所偏移

Cuiunbo commented 1 month ago

您好，一般情况下如果sft训练数据过于单一的话，会使模型对于丰富自然语言描述的跟随能力下降。如您的场景是需要用丰富自然语言交互的场景，可尝试使用语言模型对您的数据进行多样性润色

OpenBMB / MiniCPM-V