Closed dzabraev closed 4 years ago
Hi,
I am sure the YouCook2 test videos were removed from HowTo100M training videos. You can check it yourself by downloading the caption files from: https://www.di.ens.fr/willow/research/howto100m/ and checking the video ids of the HowTo100M dataset.
For me I can understand there is a drop in performance between the low quality and the standard quality decoding. CNN are known to be highly sensitive to texture instead of global shape see: https://openreview.net/forum?id=Bygh9j09KX (or https://arxiv.org/pdf/1604.04004.pdf) That's also why they are so vulnerable to adverserial perturbations. And the difference of picture I see from the good quality and the bad quality image show is significant in terms of texture thus, the difference of performance.
By the way, to me the default quality of ffmpeg is the best one and not the compressed one.
Also one thing to note is that I think qscale does not control the overall jpeg compression level but controls the video compression variable bitrate frame by frame. Which means that the compression level is varying according to frames, so looking at only one frame is not very informative, especially if there is very little motion.
Hi,
I am sure the YouCook2 test videos were removed from HowTo100M training videos. You can check it yourself by downloading the caption files from: https://www.di.ens.fr/willow/research/howto100m/ and checking the video ids of the HowTo100M dataset.
For me I can understand there is a drop in performance between the low quality and the standard quality decoding. CNN are known to be highly sensitive to texture instead of global shape see: https://openreview.net/forum?id=Bygh9j09KX (or https://arxiv.org/pdf/1604.04004.pdf) That's also why they are so vulnerable to adverserial perturbations. And the difference of picture I see from the good quality and the bad quality image show is significant in terms of texture thus, the difference of performance.
By the way, to me the default quality of ffmpeg is the best one and not the compressed one.
I ran your source code with the checkpoint you released and the YouCook2 dataset you given. The results of my evaluation on YouCook2 are R@1=12.68, R@5=34.01, R@10=46.90, Medium=12, which is much worse than the reported results R@1=15.1, R@10=38.0, R@10=51.2, Medium=10.. I used the default setting
python eval_youcook.py --batch_size=16 --num_thread_reader=20 --num_windows_test=10 \
--eval_video_root=path_to_the_youcook_videos --pretrain_cnn_path=the_path_to_the_checkpoint
I tested your model on YouCookII with this protocol (4x32 contiguous frames at 10FPS). I extrcated images from video in two ways.
ffmpeg -y -i <INPUT.mp4> -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg
ffmpeg -y -i <INPUT.mp4> -qscale:v 2 -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg
The first one compresses output JPGs, the last one save JPGs with the best quality.
The example on 1, 2 and
-qscale:v 31
(poorest quality). Please ignore H/W ratio, in testing I use correct H/W ration.The difference between 1 and 2 is small.
Note: some videos from YouCookII are unavailable today, so I tested only on available videos.
Despite small difference between 1 and 2, the test difference is sufficient. It may be because some intersection between YouCookII and HowTo100M wasn't filtered, and network learned some videos from this intersection.
My question is. Are you sure that intersection between YouCookII and HowTo100M was completely removed from train dataset? Could you post in this thread youtube video ids that was used for train? (or that was thrown away?). I want to do double check about intersection.