Open mokby opened 2 months ago
9
这个帖子无法回答我的问题,里面没有提到怎么做SFT的框标注 This issue can not sovle my problem, it is note mentioned how to generate bbox in SFT dataset.
the grounding SFT data can be organized in the following format, where special tokens like <|object_ref_start|>, <|object_ref_end|>, <|box_start|>, <|box_end|> ... correspond to \, \, \
# Caption with Grounding
[
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{
"type": "text",
"text": "Generate the caption with grounding:",
},
],
},
{"role": "assistant", "content": "In the image,<|object_ref_start|>a woman<|object_ref_end|><|box_start|>(462,371),(709,785)<|box_end|> is sitting on the beach with<|object_ref_start|> her dog<|object_ref_end|><|box_start|>(246,418),(586,875)<|box_end|>. The dog is wearing<|object_ref_start|> a harness<|object_ref_end|><|box_start|>(369,531),(496,704)<|box_end|> and appears to be giving<|object_ref_start|> its paw<|object_ref_end|><|box_start|>(508,512),(588,587)<|box_end|> to the woman in a high-five gesture. The scene is set during sunset, with the sun casting a warm glow over the ocean and the sand."},
],
# Visual Grounding
[
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{
"type": "text",
"text": "Output the bounding box of the person in the image.",
},
],
},
{"role": "assistant", "content": "<|object_ref_start|>person<|object_ref_end|><|box_start|>(465,372),(710,789)<|box_end|><|im_end|>"},
],
# Grounded Captioning
[
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{
"type": "text",
"text": "Describe the region <|box_start|>(462,371),(709,785)<|box_end|>.",
},
],
},
{"role": "assistant", "content": "The region in the image shows a person sitting on the beach with their dog. The person is wearing a plaid shirt and dark pants, and they appear to be enjoying a moment with their dog. The dog is wearing a harness and is giving its paw to the person. The background features the ocean with waves gently crashing onto the shore. The lighting suggests it is either sunrise or sunset, creating a warm and serene atmosphere."},
],
# OCR with Grounding
[
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://github.com/QwenLM/Qwen-VL/blob/master/assets/demo_highfive.jpg?raw=true",
},
{
"type": "text",
"text": "Output OCR Results with Grounding:",
},
],
},
{"role": "assistant", "content": "<|object_ref_start|>击掌<|object_ref_end|><|quad_start|>(502,500),(584,500),(584,558),(502,558)<|quad_end|>"},
]
似乎您提供的格式并不是sharegpt格式的数据集,和官方提供的有些出入,这是自定义的数据集格式么?LLaMA-Factory能够支持么?
It seems that the format you offered is not the offical sharegpt format, how can i use through LLaMA-Factory? Do i need to modify the dataset_info.json
in LLaMA-Factory?
可否加个联系方式,交流一下?微q695914059
Thank you for your support of Qwen2_VL. The main reason for the poor fine-tuning grounding performance is that the 1D rope position embedding, rather than MRoPE, was used during training. We will provide a solution as soon as possible, and we apologize for the inconvenience caused.
Thanks for your jobs, we are looking forward for your work.
我在观察Qwen2-vl的SFT数据格式时发现似乎和Qwen-vl的格式差别比较大,重点是没有给出定位框的标注示例了,还有就是0-1000的归一化问题,再Qwen2-vl中还需要操作么?求解答 When observing the SFT data format of Qwen2-vl , I found that there seems to be a big difference with the format of Qwen-vl, focusing on the labeling example of the bounding box is not given. And the problem of the normalization of 0-1000, does Qwen2-vl still remain? Please I need an answer, thx.