ch3cook-fdu / Vote2Cap-DETR

[CVPR 2023] Vote2Cap-DETR and [T-PAMI 2024] Vote2Cap-DETR++; A set-to-set perspective towards 3D Dense Captioning; State-of-the-Art 3D Dense Captioning methods
MIT License
76 stars 5 forks source link

Question about evaluate metric #16

Closed WeitaiKang closed 2 months ago

WeitaiKang commented 2 months ago

Hi authors,

I am new to this task and want to consult a question about the metric in this 3D dense captioning domain, which is a little bit contradictory after I checked with several papers.

In you paper, the captioning metric is averaged by the number ground truth instance, so it cannot evaluate redundant bbox prediction. However, in Scan2cap and D3Net, which you put in the same table 1, they will average the captioning metric by the percentage of correct predicted bbox. Therefore, previous related work evaluated the redundant bbox prediction.

Is it unfair of your metric, or am I missing something here? I would really appreciate your help to clarify it out!

ch3cook-fdu commented 2 months ago

Starting from the very first work on 3D Dense Captioning, Scan2Cap, the Recall-only $m$@kIoU metric has been widely adopted as the only evaluation metric in many existing works, including D3Net, SpaCap3D, and etc. Actually, the F-1 based $m$@kIoU metric you mentioned should be a better way to evaluate the overall performance of 3D Dense Captioning models, However, it is not formally introduced until the UniT3D paper.

In our paper, all methods are listed and evaluated under the Recall-only $m$@kIoU metric, thus, I believe it should be a fair comparison. If you are interested in the F-1 based $m$@kIoU metrics for our method, you can refer to the ScanRefer online test benchmark and read their document for more information.

WeitaiKang commented 2 months ago

Great! Thanks for your comprehensive explanation and mentioning those related works. It does help me a lot.