Clarification about 'validate_full_vcmr"

@aleSuglia

We follow the implementation of VCMR evaluation from TVR: https://github.com/jayleicn/TVRetrieval/blob/master/baselines/clip_alignment_with_language/inference.py

Let me first explain some high-level details:

There are several evaluation settings here:
1. SVMR: Single Video Moment Retrieval. Given the ground truth video, the model needs to predict the corresponding video segment (with start and end timestamps) according to the query.
2. VR: Video Retrieval. Given the query and the testing video corpus, the model needs to rank all video candidates and retrieve the most relevant one.
3. VCMR: Video Corpus Moment Retrieval. This is the combination of both SVMR and VR. For each of the top-k retrieved videos, the model generates video segments. The final predictions are aggregations of retrieved videos and predicted video segments, and ranked by the product of confidence scores (or probabilities) for retrieved videos/predicted video segments. In our implementation, we use k=200, which is set by the hyperparameter max_before_nms in config.
The HERO model will generate three predictions: _q2video_scores ( for video retrieval ), _st_probs (logits of start frame index for single video moment retrieval, ), _ed_probs (logits of end frame index for single video moment retrieval) https://github.com/VALUE-Leaderboard/StarterCode/blob/7f124b4ddef86af887ca593b369d470bb0e3586b/eval_vcmr.py#L275-L278
flat_st_ed_sorted_scores is the aggregated predictions for VCMR based on all three predictions above. https://github.com/VALUE-Leaderboard/StarterCode/blob/7f124b4ddef86af887ca593b369d470bb0e3586b/eval_vcmr.py#L345-L346

To explain the process:

We first compute all video embeddings at L208-246
For all queries in the evaluation split, we generate predictions of _q2video_scores, _st_probs, _ed_probs, flat_st_ed_sorted_scores for each query on each video in the corpus at L248-367
Since the four predictions are still probabilties or scores, we need to generate the actual predictions (e.g. start/end timestamps and retrieved video ids) according to these scores and save the results into svmr_res, vr_res, vcmr_res
1. SVMR prediction for each query is a list of top-100 predicted video segment, where each element is formatted as[gt_vid, start_timestamp, end_timestamp, prediction_scores], L370-398
2. VR prediction for each query is a list of top-100 retrieved video ids, where each element is formatted as[gt_vid, 0, 0, prediction_scores], L400-415
3. VCMR prediction for each query is a list of top-100 predicted video ids together with the corresponding segment in the retrieved video, where each element is formatted as[retrieved_vid, start_timestamp, end_timestamp, prediction_scores], L416-458
All these results are saved into a Python dictionary eval_res. https://github.com/VALUE-Leaderboard/StarterCode/blob/7f124b4ddef86af887ca593b369d470bb0e3586b/eval_vcmr.py#L460
If the evaluation split contains ground truth annotations, the results are sent to the evaluation function (the same one used for evaluation on CodaLab) to calculate R@1, 5, 10, 100 for all three settings (SVMR/VR/VCMR)at L467-503
In the end, L505-557 performs non-maximal suppression (nms) onto the results and then evaluate the results after nms, if nms_thd is set. In TVR, the authors have tried nms_thd=0.5. Note that all sample predictions and model performance in VALUE are without nms.

Hope this helps.

Thanks, Linjie

VALUE-Leaderboard / StarterCode

Clarification about 'validate_full_vcmr" #8