QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
2.88k stars 171 forks source link

How to use bbox in SFT/微调中的定位框问题 #105

Open mokby opened 2 months ago

mokby commented 2 months ago

我在观察Qwen2-vl的SFT数据格式时发现似乎和Qwen-vl的格式差别比较大,重点是没有给出定位框的标注示例了,还有就是0-1000的归一化问题,再Qwen2-vl中还需要操作么?求解答 When observing the SFT data format of Qwen2-vl , I found that there seems to be a big difference with the format of Qwen-vl, focusing on the labeling example of the bounding box is not given. And the problem of the normalization of 0-1000, does Qwen2-vl still remain? Please I need an answer, thx.

ShuaiBai623 commented 2 months ago

9

mokby commented 1 month ago

9

这个帖子无法回答我的问题,里面没有提到怎么做SFT的框标注 This issue can not sovle my problem, it is note mentioned how to generate bbox in SFT dataset.

kq-chen commented 1 month ago

the grounding SFT data can be organized in the following format, where special tokens like <|object_ref_start|>, <|object_ref_end|>, <|box_start|>, <|box_end|> ... correspond to \, \, \ \ tokens ... used in the Qwen-VL paper. Apart from the renaming of these special characters, all other details are consistent with the implementation in Qwen-VL, including the normalization of coordinates, etc.

# Caption with Grounding
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "Generate the caption with grounding:",
            },
        ],
    },
    {"role": "assistant", "content": "In the image,<|object_ref_start|>a woman<|object_ref_end|><|box_start|>(462,371),(709,785)<|box_end|> is sitting on the beach with<|object_ref_start|> her dog<|object_ref_end|><|box_start|>(246,418),(586,875)<|box_end|>. The dog is wearing<|object_ref_start|> a harness<|object_ref_end|><|box_start|>(369,531),(496,704)<|box_end|> and appears to be giving<|object_ref_start|> its paw<|object_ref_end|><|box_start|>(508,512),(588,587)<|box_end|> to the woman in a high-five gesture. The scene is set during sunset, with the sun casting a warm glow over the ocean and the sand."},
],
# Visual Grounding
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "Output the bounding box of the person in the image.",
            },
        ],
    },
    {"role": "assistant", "content": "<|object_ref_start|>person<|object_ref_end|><|box_start|>(465,372),(710,789)<|box_end|><|im_end|>"},
],
# Grounded Captioning
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "Describe the region <|box_start|>(462,371),(709,785)<|box_end|>.",
            },
        ],
    },
    {"role": "assistant", "content": "The region in the image shows a person sitting on the beach with their dog. The person is wearing a plaid shirt and dark pants, and they appear to be enjoying a moment with their dog. The dog is wearing a harness and is giving its paw to the person. The background features the ocean with waves gently crashing onto the shore. The lighting suggests it is either sunrise or sunset, creating a warm and serene atmosphere."},
],
# OCR with Grounding
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://github.com/QwenLM/Qwen-VL/blob/master/assets/demo_highfive.jpg?raw=true",
            },
            {
                "type": "text",
                "text": "Output OCR Results with Grounding:",
            },
        ],
    },
    {"role": "assistant", "content": "<|object_ref_start|>击掌<|object_ref_end|><|quad_start|>(502,500),(584,500),(584,558),(502,558)<|quad_end|>"},
]
mokby commented 1 month ago

似乎您提供的格式并不是sharegpt格式的数据集,和官方提供的有些出入,这是自定义的数据集格式么?LLaMA-Factory能够支持么? It seems that the format you offered is not the offical sharegpt format, how can i use through LLaMA-Factory? Do i need to modify the dataset_info.json in LLaMA-Factory?

Guangming92 commented 1 month ago

可否加个联系方式,交流一下?微q695914059

ShuaiBai623 commented 1 month ago

Thank you for your support of Qwen2_VL. The main reason for the poor fine-tuning grounding performance is that the 1D rope position embedding, rather than MRoPE, was used during training. We will provide a solution as soon as possible, and we apologize for the inconvenience caused.

mokby commented 1 month ago

Thanks for your jobs, we are looking forward for your work.