evaluation on partial test set

rahman-mdatiqur commented 4 years ago

Hello @HYPJUDY,

It seems that, you are not doing evaluation on the THUMOS'14 full test set. As you reported in your paper, THUMOS'14 detection task is evaluated on 213 test videos. But, your test window_info.log file is missing window info for the following 3 test videos, as your thumos14_test_annotation.csv is missing annotations for those videos. As a result, you are basically evaluating your model on 210 test videos instead of 213.

video_test_0000270 video_test_0001292 video_test_0001496

Can you please comment on why this is the case?

Thanks much.

HYPJUDY commented 4 years ago

Hi @rahman-mdatiqur , I am happy to clarify this issue.

TL; DR: The original source of thumos14_test_annotation.csv leaves these three videos out, because it seems that these three videos have incorrect annotations which are longer than their duration.

Details:

thumos14_test_annotation.csv is generated by data/gen_thumos14_anno.py, which is based on SSN. (see Preprocess Data by Yourself)
The ground truth file from SSN is thumos14_tag_proposal_list/thumos14_tag_test_proposal_list.txt, which do not contains these three videos. (see data/gen_thumos14_anno.py)
I also have noticed this issue two years ago (see https://github.com/yjxiong/action-detection/issues/31) and would like to share more detailed information on these three videos:

thumos14_tag_test_normalized_proposal_list.txt has 200 videos while there are 213 videos in TH14_Temporal_Annotations_Test\xgtf_renamed. Two reasonable missing videos are video_test_0000270 (its annotationa are HammerThrow but its ground truth in video is HairCut which doesn't belong to the 20 classes) and video_test_0001292 (it only has ambiguous annotations). It seems that another missing video video_test_0001496 can be included into test list after modifying the annotations (annotations are CricketShot while ground truth is FrisbeeCatch).

Another related issue FYI: https://github.com/yjxiong/action-detection/issues/13
The original replies to this issue in SSN:

Their annotation seems to be longer than the duration of the videos. It is better to leave them out.

If I remember correctly, these 3 videos have incorrect annotations which are sitting beyond the videos’ time span.

rahman-mdatiqur commented 4 years ago

Hello @HYPJUDY,

Thanks again for your quick and wonderful response.

It eventually directs me to raise the following concern.

Since you do not mention in your paper that you are evaluating on 210 videos instead of 213, how does it fare to compare your method in Table. 2 against other SOTA methods that report results on 213 test videos? I mean, does leaving those test videos out from the evaluation give you any advantage over the other SOTA methods in terms of mAP? I know that you are not removing the corresponding annotations from the ground-truth annotations located in https://github.com/HYPJUDY/Decouple-SSAD/tree/master/EvalKit/THUMOS14_evalkit_20150930/annotation. But, I did not check the evaluation script to see if it would be advantageous/disadvantageous to leave some videos out from the evaluation set.

Can you please comment on this?

Thanks in advance.

HYPJUDY commented 4 years ago

Hi @rahman-mdatiqur , thanks for your good question. I only a quick look on the evaluation script since I do not have enough time currently.

https://github.com/HYPJUDY/Decouple-SSAD/blob/e3b65392cc4be7dd3887bd38996792ce6dc698f0/EvalKit/THUMOS14_evalkit_20150930/TH14evalDet.m#L135 It seems that the code only evaluate the common videos of ground truths and detected results. So if the model can produce good (bad) results for these three videos, then the map should be better (worse) by incorporating their results. If your code is ready, you can quickly validate these by ablation experiments:

Remove more videos from the result file to see the change of performance.
Remove the ground truth annotations from the EvalKit to see the change of perfomance.
Including the evaluation result of these three videos to see the change of performance.

I think if the annotations of some videos are obviously wrong, then we should exclude them. Otherwise the overall result is not correct and the evaluation on these wrong annotated videos is meaningless. I should have clarified the video number (210) in paper. Thanks for you reminding.

rahman-mdatiqur commented 4 years ago

Thanks @HYPJUDY for suggesting ways to evaluate the effect of excluding videos from the predictions list. As you said, since doing good(bad) on these videos may improve(downgrade) final mAP@0.5, and since the SOTA methods report results on all 213 videos without making any modifications to the ground-truth annotations, I believe, new methods should follow the same protocol when comparing with SOTA methods, or mention the video numbers while comparing.

Thanks much for all the thoughts and helpful feedback.

HYPJUDY commented 4 years ago

You are welcome!

HYPJUDY / Decouple-SSAD

evaluation on partial test set #23