How to reproduce the refcoco results

liujunzhuo commented 7 months ago

I'm currently diving into your fascinating work, but I'm running into a bit of trouble while trying to replicate the results on the RefCOCO dataset using the provided model checkpoint. My tests are showing an accuracy@0.5 of 0.75 on the RefCOCO val set.

Environment

mmcv 2.0.1
mmdet 3.3.0
mmengine 0.8.4
Other libraries installed as the requirements.txt.

Approach

I have used the mdetr-provided COCO format annotations for RefCOCO, which align with the mm-grounding DINO. The evaluation involves organizing annotations into a list and running a simple loop. During each iteration, I replace g_dino_caption and image_path with corresponding RefCOCO elements and conduct inference. I'm using an evaluator similar to mm-grounding DINO to save and compute the results.
I also tried both prompt templates mentioned in the paper, and the results remain the same. 'Please detect the {g_dino_caption} in this image.' and 'What is {g_dino_caption} in this image? Please output object location.'

Could you please confirm whether the provided checkpoint is the final version for RefCOCO? Additionally, any insights or suggestions on potential issues in my evaluation approach would be highly appreciated.

Thank you for your assistance and for providing the resources for replication!

weifei7 commented 7 months ago

Thanks for your attention! Yes, the checkpoint is related to paper's result. Have you set your box_threshold to a high value like 0.7? In our exp, we set it to 0.3 to make sure there is around one box per image. This value is refer to GroundingDINO's inference setting. Moreover, have you get the right result through chat.py successfully? It should be same with the res we show in paper. If it doesn't solve your problem, please feel free to leave a message here and we'll get back to you as soon as we see it!

liujunzhuo commented 7 months ago

Thank you for your quick response! I have set the threshold to 0, and in the refcoco validation set, only 118/10834 samples have scores below 0.3, which should not significantly impact the results. The previous results were based on Top-1 accuracy, with the complete results as follows:

Precision @ 1	Precision @ 5	Precision @ 10
0.752353701	0.972217094	0.985231678

I also attempted to include accuracy for all box_score > 0.3, and the result is 0.87.

Here is my evaluator, using 0.3 as the threshold. I call evaluator.process(outputs, data["image_id"]) to save the results at the end of each iteration and finally use compute_metrics to calculate the overall results.:

class RefEvalThr:
    def __init__(self,
                 ann_file: Optional[str] = None,
                 metric: str = 'bbox',
                 topk=(1, 5, 10),
                 iou_thrs: float = 0.5,
                 save_dir: Optional[str] = None,
                 **kwargs) -> None:
        # super().__init__(**kwargs)
        self.results = []
        self.metric = metric
        self.topk = topk
        self.iou_thrs = iou_thrs
        self.save_dir = save_dir

        self.coco = COCO(ann_file)

    def process(self, data_samples: Sequence, id) -> None:
        for data_sample in data_samples:
            result = dict()
            pred = data_sample.pred_instances
            result['img_id'] = id
            result['bboxes'] = pred.bboxes.cpu().numpy().copy()
            result['scores'] = pred.scores.cpu().numpy().copy()
            self.results.append(result)

    def compute_metrics(self, results: list):
        if self.save_dir is not None:
            np.save(self.save_dir, results)
        dataset2score = {k: 0.0 for k in self.topk}
        dataset2count = 0
        count_low_score = 0

        for result in results:
            img_id = result['img_id']
            ann_ids = self.coco.getAnnIds(imgIds=img_id)
            assert len(ann_ids) == 1
            img_info = self.coco.loadImgs(img_id)[0]
            target = self.coco.loadAnns(ann_ids[0])

            target_bbox = target[0]['bbox']
            converted_bbox = [
                target_bbox[0],
                target_bbox[1],
                target_bbox[2] + target_bbox[0],
                target_bbox[3] + target_bbox[1],
            ]
            iou = mmdet.evaluation.bbox_overlaps(result['bboxes'],
                                np.array(converted_bbox).reshape(-1, 4)).reshape(-1)
            # iou = torchvision.ops.box_iou(torch.as_tensor(result['bboxes']),
            #                            torch.as_tensor(converted_bbox).reshape(-1, 4)).numpy().reshape(-1)
            # giou = torchvision.ops.generalized_box_iou(torch.as_tensor(result['bboxes']),
            #                            torch.as_tensor(converted_bbox).reshape(-1, 4)).numpy().reshape(-1)
            scores = result['scores']
            filtered_iou = iou[scores >= 0.3]
            if len(filtered_iou) > 0 and max(filtered_iou) >= self.iou_thrs:
                dataset2score[1] += 1.0
            dataset2count += 1.0
            # if result['scores'][0] < 0.3:
            #     count_low_score += 1
        print(f"count positive: {dataset2count}")
        # print(f"count low_score: {count_low_score}")

        for k in self.topk:
            try:
                dataset2score[k] /= dataset2count
            except Exception as e:
                print(e)
        print(dataset2score)

Here are some examples of wrong predictions:

I wonder if my evaluation approach is flawed, or if there might be inconsistencies in the mmdet version, or if there are issues with hyperparameters, randomness, or other factors. Your insights on these matters would be greatly appreciated. Thank you once again for your swift response!

weifei7 commented 7 months ago

Hi, here is a sample: caption: chair that the boy sit on prompt: Please detect the chair that the boy sit on in this image. image: COCO_train2014_000000083508.jpg res:

pred_bboxes_filt: [[ 39.1294, 49.1208, 217.3541, 402.2947]] pred_scores_filt: [0.7888]

before filtered by threshold(0.3), top 20 pred_bboxes are:

[[ 39.1294,  49.1208, 256.4835, 451.4155],
[386.3545, 222.0057, 640.0000, 452.6325],
[502.6339, 106.8788, 639.1548, 243.7754],
[ 33.5774,  47.3136, 640.0000, 453.1940],
[179.8883,  94.7124, 412.8659, 451.1689],
[378.2805, 106.8258, 640.0000, 452.1395],
[250.4331,  96.8331, 381.4485, 274.8452],
[372.9280, 106.1171, 640.0000, 453.8893],
[374.7803, 106.9710, 640.0000, 452.8337],
[249.4045,  95.0679, 640.0000, 260.4560],
[ 36.7581,  49.5404, 417.0672, 454.6150],
[367.1216, 105.4114, 640.0000, 456.2697],
[250.8099,  91.4537, 640.0000, 262.8784],
[165.9880,  97.8375, 640.0000, 454.9725],
[170.9866,  97.8001, 640.0000, 456.0018],
[458.9583, 256.8557, 638.2808, 327.3876],
[248.9486,  95.4559, 640.0000, 259.1696],
[ 36.0579,  45.2150, 640.0000, 304.8564],
[320.7637, 104.3464, 640.0000, 452.3668],
[487.3306,   0.7818, 640.0000, 251.5550]]

top 20 pred_scores are:

[0.7888, 0.0627, 0.0613, 0.0261, 0.0121, 0.0079, 0.0065, 0.0051, 0.0043,
0.0039, 0.0036, 0.0034, 0.0031, 0.0030, 0.0030, 0.0028, 0.0028, 0.0021,
0.0021, 0.0020]

You can use this sample to check whether there is a problem with the mmdet version or some parameters. Or you can provide the image name of your test examples, we will test on them~

liujunzhuo commented 6 months ago

top 20 pred_bboxes：

[[39.1192, 49.1206, 256.4813, 451.4343], 
[386.4013, 222.0148, 640.0, 452.6575], 
[502.6575, 106.8746, 639.1531, 243.784], 
[8.2782, 48.1111, 640.0, 454.3926], 
[180.2233, 94.68, 412.8537, 451.0923], 
[379.0567, 106.5491, 640.0, 453.942], 
[250.7527, 96.8303, 382.9833, 273.9555], 
[386.8084, 106.4516, 640.0, 453.771], 
[378.004, 105.6164, 640.0, 457.0],
[36.7511, 49.5608, 416.94, 454.7256], 
[36.405, 47.306, 640.0, 454.4246], 
[207.8284, 232.7865, 416.9753, 455.4554], 
[171.7513, 95.6881, 640.0, 455.5508], 
[250.8095, 91.4642, 640.0, 262.4736], 
[383.1513, 106.5066, 640.0, 452.8468], 
[250.5402, 95.1247, 640.0, 255.6561], 
[166.0703, 97.7855, 640.0, 454.9286], 
[379.0126, 107.7026, 640.0, 453.3336], 
[170.4569, 98.0551, 640.0, 456.0147], 
[387.941, 105.8021, 639.6733, 313.0541]]

top 20 pred_scores：

[0.7895, 0.0625, 0.0622, 0.0231, 0.0116, 0.0069, 0.0058, 0.0051, 0.0038, 0.0037, 0.0035, 0.0034, 0.0033, 0.0033, 0.0031, 0.0031, 0.003, 0.0029, 0.0027, 0.0025]

The results are similar, possibly due to the randomness by the environment. Thank you for your reply and excellent work!

Meituan-AutoML / Lenna

How to reproduce the refcoco results #10

Environment

Approach