Closed liujunzhuo closed 6 months ago
Thanks for your attention! Yes, the checkpoint is related to paper's result. Have you set your box_threshold to a high value like 0.7? In our exp, we set it to 0.3 to make sure there is around one box per image. This value is refer to GroundingDINO's inference setting. Moreover, have you get the right result through chat.py successfully? It should be same with the res we show in paper. If it doesn't solve your problem, please feel free to leave a message here and we'll get back to you as soon as we see it!
Thank you for your quick response! I have set the threshold to 0, and in the refcoco validation set, only 118/10834 samples have scores below 0.3, which should not significantly impact the results. The previous results were based on Top-1 accuracy, with the complete results as follows:
Precision @ 1 | Precision @ 5 | Precision @ 10 |
---|---|---|
0.752353701 | 0.972217094 | 0.985231678 |
I also attempted to include accuracy for all box_score > 0.3
, and the result is 0.87.
Here is my evaluator, using 0.3 as the threshold. I call evaluator.process(outputs, data["image_id"])
to save the results at the end of each iteration and finally use compute_metrics
to calculate the overall results.:
class RefEvalThr:
def __init__(self,
ann_file: Optional[str] = None,
metric: str = 'bbox',
topk=(1, 5, 10),
iou_thrs: float = 0.5,
save_dir: Optional[str] = None,
**kwargs) -> None:
# super().__init__(**kwargs)
self.results = []
self.metric = metric
self.topk = topk
self.iou_thrs = iou_thrs
self.save_dir = save_dir
self.coco = COCO(ann_file)
def process(self, data_samples: Sequence, id) -> None:
for data_sample in data_samples:
result = dict()
pred = data_sample.pred_instances
result['img_id'] = id
result['bboxes'] = pred.bboxes.cpu().numpy().copy()
result['scores'] = pred.scores.cpu().numpy().copy()
self.results.append(result)
def compute_metrics(self, results: list):
if self.save_dir is not None:
np.save(self.save_dir, results)
dataset2score = {k: 0.0 for k in self.topk}
dataset2count = 0
count_low_score = 0
for result in results:
img_id = result['img_id']
ann_ids = self.coco.getAnnIds(imgIds=img_id)
assert len(ann_ids) == 1
img_info = self.coco.loadImgs(img_id)[0]
target = self.coco.loadAnns(ann_ids[0])
target_bbox = target[0]['bbox']
converted_bbox = [
target_bbox[0],
target_bbox[1],
target_bbox[2] + target_bbox[0],
target_bbox[3] + target_bbox[1],
]
iou = mmdet.evaluation.bbox_overlaps(result['bboxes'],
np.array(converted_bbox).reshape(-1, 4)).reshape(-1)
# iou = torchvision.ops.box_iou(torch.as_tensor(result['bboxes']),
# torch.as_tensor(converted_bbox).reshape(-1, 4)).numpy().reshape(-1)
# giou = torchvision.ops.generalized_box_iou(torch.as_tensor(result['bboxes']),
# torch.as_tensor(converted_bbox).reshape(-1, 4)).numpy().reshape(-1)
scores = result['scores']
filtered_iou = iou[scores >= 0.3]
if len(filtered_iou) > 0 and max(filtered_iou) >= self.iou_thrs:
dataset2score[1] += 1.0
dataset2count += 1.0
# if result['scores'][0] < 0.3:
# count_low_score += 1
print(f"count positive: {dataset2count}")
# print(f"count low_score: {count_low_score}")
for k in self.topk:
try:
dataset2score[k] /= dataset2count
except Exception as e:
print(e)
print(dataset2score)
Here are some examples of wrong predictions:
I wonder if my evaluation approach is flawed, or if there might be inconsistencies in the mmdet version, or if there are issues with hyperparameters, randomness, or other factors. Your insights on these matters would be greatly appreciated. Thank you once again for your swift response!
Hi, here is a sample:
caption: chair that the boy sit on
prompt: Please detect the chair that the boy sit on in this image.
image: COCO_train2014_000000083508.jpg
res:
pred_bboxes_filt: [[ 39.1294, 49.1208, 217.3541, 402.2947]]
pred_scores_filt: [0.7888]
before filtered by threshold(0.3), top 20 pred_bboxes are:
[[ 39.1294, 49.1208, 256.4835, 451.4155],
[386.3545, 222.0057, 640.0000, 452.6325],
[502.6339, 106.8788, 639.1548, 243.7754],
[ 33.5774, 47.3136, 640.0000, 453.1940],
[179.8883, 94.7124, 412.8659, 451.1689],
[378.2805, 106.8258, 640.0000, 452.1395],
[250.4331, 96.8331, 381.4485, 274.8452],
[372.9280, 106.1171, 640.0000, 453.8893],
[374.7803, 106.9710, 640.0000, 452.8337],
[249.4045, 95.0679, 640.0000, 260.4560],
[ 36.7581, 49.5404, 417.0672, 454.6150],
[367.1216, 105.4114, 640.0000, 456.2697],
[250.8099, 91.4537, 640.0000, 262.8784],
[165.9880, 97.8375, 640.0000, 454.9725],
[170.9866, 97.8001, 640.0000, 456.0018],
[458.9583, 256.8557, 638.2808, 327.3876],
[248.9486, 95.4559, 640.0000, 259.1696],
[ 36.0579, 45.2150, 640.0000, 304.8564],
[320.7637, 104.3464, 640.0000, 452.3668],
[487.3306, 0.7818, 640.0000, 251.5550]]
top 20 pred_scores are:
[0.7888, 0.0627, 0.0613, 0.0261, 0.0121, 0.0079, 0.0065, 0.0051, 0.0043,
0.0039, 0.0036, 0.0034, 0.0031, 0.0030, 0.0030, 0.0028, 0.0028, 0.0021,
0.0021, 0.0020]
You can use this sample to check whether there is a problem with the mmdet version or some parameters. Or you can provide the image name of your test examples, we will test on them~
top 20 pred_bboxes:
[[39.1192, 49.1206, 256.4813, 451.4343],
[386.4013, 222.0148, 640.0, 452.6575],
[502.6575, 106.8746, 639.1531, 243.784],
[8.2782, 48.1111, 640.0, 454.3926],
[180.2233, 94.68, 412.8537, 451.0923],
[379.0567, 106.5491, 640.0, 453.942],
[250.7527, 96.8303, 382.9833, 273.9555],
[386.8084, 106.4516, 640.0, 453.771],
[378.004, 105.6164, 640.0, 457.0],
[36.7511, 49.5608, 416.94, 454.7256],
[36.405, 47.306, 640.0, 454.4246],
[207.8284, 232.7865, 416.9753, 455.4554],
[171.7513, 95.6881, 640.0, 455.5508],
[250.8095, 91.4642, 640.0, 262.4736],
[383.1513, 106.5066, 640.0, 452.8468],
[250.5402, 95.1247, 640.0, 255.6561],
[166.0703, 97.7855, 640.0, 454.9286],
[379.0126, 107.7026, 640.0, 453.3336],
[170.4569, 98.0551, 640.0, 456.0147],
[387.941, 105.8021, 639.6733, 313.0541]]
top 20 pred_scores:
[0.7895, 0.0625, 0.0622, 0.0231, 0.0116, 0.0069, 0.0058, 0.0051, 0.0038, 0.0037, 0.0035, 0.0034, 0.0033, 0.0033, 0.0031, 0.0031, 0.003, 0.0029, 0.0027, 0.0025]
The results are similar, possibly due to the randomness by the environment. Thank you for your reply and excellent work!
I'm currently diving into your fascinating work, but I'm running into a bit of trouble while trying to replicate the results on the RefCOCO dataset using the provided model checkpoint. My tests are showing an accuracy@0.5 of 0.75 on the RefCOCO val set.
Environment
Approach
I have used the mdetr-provided COCO format annotations for RefCOCO, which align with the mm-grounding DINO. The evaluation involves organizing annotations into a list and running a simple loop. During each iteration, I replace
g_dino_caption
andimage_path
with corresponding RefCOCO elements and conduct inference. I'm using an evaluator similar to mm-grounding DINO to save and compute the results.I also tried both prompt templates mentioned in the paper, and the results remain the same.
'Please detect the {g_dino_caption} in this image.' and 'What is {g_dino_caption} in this image? Please output object location.'
Could you please confirm whether the provided checkpoint is the final version for RefCOCO? Additionally, any insights or suggestions on potential issues in my evaluation approach would be highly appreciated.
Thank you for your assistance and for providing the resources for replication!