Soldelli / VLG-Net

VLG-Net: Video-Language Graph Matching Networks for Video Grounding
MIT License
30 stars 1 forks source link

About R@1 iou1.0 on didemo #7

Closed LLLddddd closed 1 year ago

LLLddddd commented 1 year ago

Could you please tell me why the performance on didemo has similar scores between iou0.7 and iou1.0? Generally, model performance on iou1.0 is much lower than iou0.7. But R@1 iou0.7 (25.57) and R@1 iou1.0 (25.57) are equal in this paper (Table 4).

And when I try to run other models on didemo, I get similar results as VLG-Net on iou0.5 and iou0.7, but much lower results on iou1.0 (R1 iou1.0=0.12, R5 iou1.0=0.90). Is there any mistakes I may make?

LLLddddd commented 1 year ago

@Soldelli

Soldelli commented 1 year ago

Dear @LLLddddd thank you for reaching out. The key to understanding DiDeMo results is to keep in mind the annotations format. Each annotation has been collected as a coarse alignment between the video and the textual query. In fact, start and end timestamps can only be multiple of 5s.

Therefore you can design your proposal scheme to reflect such characteristics. In this paper we follow the same approach to proposal generation as in 2D-TAN, and have sparse proposals of given durations that match the dataset distribution.

In particular, for DiDeMo, there is a total of 21 possible "meaningful" proposals. Given such a predefined small set, the model simply learns to pick the best one, leading to potentially high performance for R@1. In this particular proposals configuration, it just so happens that the performance for different thresholds does not change much.

If you were to have a denser ("more fine-grained") proposal scheme that does not consider the annotations distribution, it might be much harder to get high performance for R@1, effectively yielding a large gap between different IoU thresholds.

I hope this helps in clarifying your doubt. Feel free to ask more questions were you in need of additional clarifications.

Best, Mattia