Open aleSuglia opened 3 years ago
@aleSuglia
We follow the implementation of VCMR evaluation from TVR: https://github.com/jayleicn/TVRetrieval/blob/master/baselines/clip_alignment_with_language/inference.py
Let me first explain some high-level details:
There are several evaluation settings here:
SVMR
: Single Video Moment Retrieval. Given the ground truth video, the model needs to predict the corresponding video segment (with start and end timestamps) according to the query.VR
: Video Retrieval. Given the query and the testing video corpus, the model needs to rank all video candidates and retrieve the most relevant one.VCMR
: Video Corpus Moment Retrieval. This is the combination of both SVMR
and VR
. For each of the top-k
retrieved videos, the model generates video segments. The final predictions are aggregations of retrieved videos and predicted video segments, and ranked by the product of confidence scores (or probabilities) for retrieved videos/predicted video segments. In our implementation, we use k=200
, which is set by the hyperparameter max_before_nms
in config. The HERO model will generate three predictions: _q2video_scores
( for video retrieval ), _st_probs
(logits of start frame index for single video moment retrieval, ), _ed_probs
(logits of end frame index for single video moment retrieval)
https://github.com/VALUE-Leaderboard/StarterCode/blob/7f124b4ddef86af887ca593b369d470bb0e3586b/eval_vcmr.py#L275-L278
flat_st_ed_sorted_scores
is the aggregated predictions for VCMR
based on all three predictions above. https://github.com/VALUE-Leaderboard/StarterCode/blob/7f124b4ddef86af887ca593b369d470bb0e3586b/eval_vcmr.py#L345-L346
To explain the process:
_q2video_scores, _st_probs, _ed_probs, flat_st_ed_sorted_scores
for each query on each video in the corpus at L248-367svmr_res, vr_res, vcmr_res
SVMR
prediction for each query is a list of top-100 predicted video segment, where each element is formatted as[gt_vid, start_timestamp, end_timestamp, prediction_scores]
, L370-398VR
prediction for each query is a list of top-100 retrieved video ids, where each element is formatted as[gt_vid, 0, 0, prediction_scores]
, L400-415VCMR
prediction for each query is a list of top-100 predicted video ids together with the corresponding segment in the retrieved video, where each element is formatted as[retrieved_vid, start_timestamp, end_timestamp, prediction_scores]
, L416-458eval_res
. https://github.com/VALUE-Leaderboard/StarterCode/blob/7f124b4ddef86af887ca593b369d470bb0e3586b/eval_vcmr.py#L460R@1, 5, 10, 100
for all three settings (SVMR
/VR
/VCMR
)at L467-503nms_thd
is set. In TVR, the authors have tried nms_thd=0.5
. Note that all sample predictions and model performance in VALUE are without nms. Hope this helps.
Thanks, Linjie
HI @linjieli222,
I was looking at the validation phase of the retrieval setup and I can see you have implemented two different variants:
validate
andfull_validate
. From my understanding,validate
only computes loss function scores whilefull_validate
also generates predictions for which retrieval-based metrics are used. Thevalidate
setup is quite straightforward to me, so no clarification is required there. However, thefull_validate
looks quite complex and I haven't seen an actual description of this method in the original paper. In particular, could you please report a brief description of the steps that are reported in this function: https://github.com/VALUE-Leaderboard/StarterCode/blob/main/eval_vcmr.py#L172?