deepcs233 / Visual-CoT

[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Apache License 2.0
134 stars 7 forks source link

viscot_363k.json #1

Open lyc728 opened 7 months ago

lyc728 commented 7 months ago

你好,json中的from": "gpt","value": "[0.133, 0.532, 0.187, 0.553]"这个值是怎么得到的

deepcs233 commented 7 months ago

Hi! 请参考我们的论文:https://arxiv.org/abs/2403.16999 中的3.1节

lyc728 commented 7 months ago

看了下文章,没有确切的回复,麻烦解答下,看坐标并不是简单的进行归一化,像是对值进行一定缩放

deepcs233 commented 7 months ago

你好, 我们先通过一些方法得到基于原始图片像素值的bounding box。为了方便后续的训练,我们先将原始图片补全至正方形,同时将bounding box也做相同的映射。最后再将bounding box做归一化,即除以图片的边长。可以参考下面的代码

def get_bbox_str(bboxs, width, height):
    if len(bboxs) > 1:
        large_bbox = []
        large_bbox.append(min([x[0] for x in bboxs]))
        large_bbox.append(min([x[1] for x in bboxs]))
        large_bbox.append(max([x[2] for x in bboxs]))
        large_bbox.append(max([x[3] for x in bboxs]))
        bbox = large_bbox
    else:
        bbox = bboxs[0]
    if width > height:
        bbox[1] += (width - height) // 2
        bbox[3] += (width - height) // 2
        bbox = [x/width for x in bbox]
    else:
        bbox[0] += (height - width) // 2
        bbox[2] += (height - width) // 2
        bbox = [x/height for x in bbox]
    return '[%0.3f, %0.3f, %0.3f, %0.3f]' % (bbox[0], bbox[1], bbox[2], bbox[3])
LengSicong commented 7 months ago
image

what is the value after the image path? e.g., [198, 114, 240, 146]

Is it the bbox before the padding and normalization?

deepcs233 commented 7 months ago

Hi! @LengSicong It's the original bbox before preprocessing.

LengSicong commented 7 months ago

Noted with thanks!