Question regarding evaluation methodology

nogaini commented 2 years ago

Hi Georgios,

Many thanks to you and the other authors for open-sourcing this work! :)

I'm using the code in this repo to test out other summarization models as well, and had a question regarding the evaluation methodology, and was hoping you could clarify it for me - when evaluation/evaluate_exp.sh is run, I get the best epoch (and hence corresponding checkpoint) for each split file. The F-Scores for each split and average F-Score reported by this script are on the test videos. Now, if I wish to quickly check the performance of some summarization model like in Table III of your paper, I can look at this average F-Score, right? To rephrase, is the average F-Score returned by this script the same as the one reported in Table III in the paper?

Best, Jobin

e-apostolidis commented 2 years ago

Dear Jobin,

First of all, thanks for your interest in our code! Now, to answer your question: the results reported in Table III of our paper are associated to a special case where the average performance of each method is formulated by the average of the maximum performance on the test set of each utilized data split. This means that, for each method, we manually pick the maximum F-Score on the test data of each data split, and we compute the average of these values (we followed this evaluation protocol when comparing our method with VASNet and MSVA, as this is the applied evaluation approach in these works). In Table IV of our paper, we follow an evaluation approach that automatically picks a well-trained model of the method using only training data (through the proposed model selection criterion) and computes the method's performance by averaging the performance of the different selected models (one per data split) on the test set of each data split. We believe that the latter approach is a more valid one, and this is why we used the results of this approach when comparing our method with other video summarization approaches from the literature, in Table IV.

The "evaluate_exp.sh" script contains the model selection step (see evaluation/choose_best_epoch.py $exp_path $dataset), and so the results returned by this script are the ones reported in Table IV.

I hope this explanations helps. If there are any other questions, please let me know.

Kind regards, Evlampios

nogaini commented 2 years ago

Dear Evlampios,

I'm really sorry about my late reply! Thank you for your detailed response, this cleared things up for me! :)

I had a related question though - do you also perform evaluation on the "Augmented" and "Transfer" settings as well, as described in, for example, the VASNet and DR-DSN (Zhou et al. 2018) papers? I couldn't find any mention of this in your paper or this repo, so I was curious to know your thoughts on the same.

Warm regards, Jobin

e-apostolidis commented 2 years ago

Dear Jobin,

We did not make any evaluations following the Augmented and Transfer settings that are reported in a few works of the literature. The use of the OVP and YouTube datasets for supervised training of a video summarization model could be a bit tricky, as the ground-truth annotations in these datasets indicate sets of keyframes (static summaries). The multiple available annotations for each video have to be used to create a single ground-truth annotation per video, which, ideally, has to be in non-binary form, in order to facilitate training. We did some tests with some unsupervised methods (which do not require ground-truth annotations) under the Augmented and Transfer settings, which showed poor summarization performance.

Kind regards, Evlampios

nogaini commented 2 years ago

Dear Evlampios,

Thanks for the clarification, this helped a lot in understanding the problem better! :) Closing this issue now as my queries have been answered.

Best, Jobin

e-apostolidis / PGL-SUM

Question regarding evaluation methodology #8