OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
5.95k stars 461 forks source link

自定义fine-tune InternVL-Chat-V1.2 on a custom dataset,代码有问题? #211

Closed yanzaaaasa closed 3 months ago

yanzaaaasa commented 5 months ago

按这个加载数据 { "sharegpt4v_instruct_gpt4-vision_cap100k": { "root": "playground/data/", "annotation": "playground/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl", "data_augment": false, "repeat_time": 1, "length": 102025 } } 提供的标注文件,其中VG的数据集格式如下 {"id": "VG_100K_2/57", "image": "vg/VG_100K_2/57.jpg", "conversations": [{"from": "human", "value": "\nPlease provide a short description for this region: [0.04, 0.35, 0.27, 0.43]."}, {"from": "gpt", "value": "Red and white sign."}, {"from": "human", "value": "Please provide a short description for this region: [0.85, 0.54, 0.9, 0.62]."}, {"from": "gpt", "value": "Woman wearing a skirt."}, {"from": "human", "value": "Please provide a short description for this region: [0.63, 0.38, 0.7, 0.48]."}, {"from": "gpt", "value": "Window in the building."}, {"from": "human", "value": "Please provide a short description for this region: [0.5, 0.5, 0.69, 0.8]."}, {"from": "gpt", "value": "He is crossing the street."}, {"from": "human", "value": "Please provide a short description for this region: [0.54, 0.52, 0.65, 0.75]."}, {"from": "gpt", "value": "A man in a suit."}, {"from": "human", "value": "Please provide a short description for this region: [0.47, 0.55, 0.73, 0.65]."}, {"from": "gpt", "value": "A car driving on the street."}, {"from": "human", "value": "Please provide the bounding box coordinate of the region this sentence describes: two women walking in the sidewalk."}, {"from": "gpt", "value": "[0.28, 0.49, 0.39, 0.67]"}, {"from": "human", "value": "Please provide a short description for this region: [0.17, 0.6, 0.21, 0.67]."}, {"from": "gpt", "value": "A black short pole on the sidewalk."}]}

其中 坐标框和对应描述没有特殊token标记,没有按 xxx[588,499,725,789] 处理,从https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/internvl/train/internvl_chat_finetune.py 中的 train_dataset = build_datasets( data_args, tokenizer, tcs_loader, model, group_by_length=training_args.group_by_length, dynamic_image_size=data_args.dynamic_image_size, use_thumbnail=data_args.use_thumbnail, min_dynamic_patch=data_args.min_dynamic_patch, max_dynamic_patch=data_args.max_dynamic_patch, normalize_type=data_args.normalize_type) 读取数据后如下, \nPlease provide a short description for this region: [0.6, 0.22, 0.75, 0.61].<|im_end|> <|im_start|> assistant\nPlastic reusable water bottle.<|im_end|> <|im_start|> user\nPlease provide the bounding box coordinate of the region this sentence describes: chairs are wooden and black top.<|im_end|> <|im_start|> assistant\n[0.13, 0.68, 0.38, 0.84]<|im_end|> <|im_start|> user\nPlease provide the bounding box coordinate of the region this sentence describes: used plate with crumbs on it.<|im_end|> <|im_start|> assistant\n[0.04, 0.26, 0.35, 0.39]<|im_end|> <|im_start|> user\nPlease provide the bounding box coordinate of the region this sentence describes: childs artwork on sports bottle.<|im_end|> <|im_start|> assistant\n[0.6, 0.47, 0.73, 0.57]<|im_end|> <|im_start|> user\nPlease provide the bounding box coordinate of the region this sentence describes: a small box is on the counter.<|im_end|> <|im_start|> assistant\n[0.24, 0.4, 0.35, 0.47]<|im_end|> <|im_start|> user\nPlease provide the bounding box coordinate of the region this sentence describes: red lid on sports bottle.<|im_end|> <|im_start|> assistant\n[0.62, 0.22, 0.74, 0.34]<|im_end|> <|im_start|> user\nPlease provide the bounding box coordinate of the region this sentence describes: an empty plate with a fork.<|im_end|> <|im_start|> assistant\n[0.03, 0.24, 0.38, 0.4]<|im_end|> <|im_start|> user\nPlease provide a short description for this region: [0.42, 0.41, 0.59, 0.57].<|im_end|> <|im_start|> assistant\nRed apple has not been peeled.<|im_end|> <|im_start|> user\nPlease provide a short description for this region: [0.14, 0.29, 0.21, 0.39].<|im_end|> <|im_start|> assistant\nUsed fork on plate.<|im_end|> <|im_start|> user\nPlease provide the bounding box coordinate of the region this sentence describes: glass with milk on the table.<|im_end|> <|im_start|> assistant\n[0.31, 0.15, 0.41, 0.27]<|im_end|>'],没有特殊token标记,且坐标也是归一化后的。 这样微调有问题吧,请问一下咱们放出的这个数据标注文件是不有问题download the annotation files and place them in the playground/opensource/ folder.

另外从refcoco的验证代码来看,提供的prompt是Please provide the bounding box coordinate of the region this sentence describes: second brown banana from the right 且推理返回结果 giraffe in middle [[331, 160, 703, 751]] ,明显和微调数据集的标注有差异。请问一下直接按微调代码中读出的数据微调模型是不会有问题?

yanzaaaasa commented 5 months ago

ref 和 box 显示不出来,去掉特殊符号< 另外从refcoco的验证代码来看,提供的prompt是Please provide the bounding box coordinate of the region this sentence describes: ref brown bananas on far right ref 且推理返回结果 ref bowl of carrots ref box [[231, 336, 947, 1000]] box ,明显和微调数据集的标注有差异。请问一下直接按微调代码中读出的数据微调模型是不会有问题?

Weiyun1025 commented 3 months ago

您好,我们的区域感知数据的标准格式是:<ref>文本</ref><box>[[x1, y1, x2, y2]]</box>,不过为了提升模型的泛化能力,这这个格式并不是完全严格的

例如对于VG数据,其格式为:Please provide a short description for this region: <box>[[x1, y1, x2, y2]]</box>,如果您希望用Please provide a short description for this <ref>region</ref>: <box>[[x1, y1, x2, y2]]</box>的格式训练也是完全ok的,对最终性能影响不会特别大

qinb commented 2 months ago

ref 和 box 显示不出来,去掉特殊符号< 另外从refcoco的验证代码来看,提供的prompt是Please provide the bounding box coordinate of the region this sentence describes: ref brown bananas on far right ref 且推理返回结果 ref bowl of carrots ref box [[231, 336, 947, 1000]] box ,明显和微调数据集的标注有差异。请问一下直接按微调代码中读出的数据微调模型是不会有问题?

@czczup @Weiyun1025 @yanzaaaasa 我现在遇到类似的问题,我用的是的格式,结果模型训练完,微调Internvl2-1B模型,推理的时候,模型输出就没有了 我的ref是数字,导致两者就粘连到一起了,请问如何解决?