Dramatically accuracy drop with JPG compression

dzabraev commented 4 years ago

I tested your model on YouCookII with this protocol (4x32 contiguous frames at 10FPS). I extrcated images from video in two ways.

ffmpeg -y -i <INPUT.mp4> -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg
ffmpeg -y -i <INPUT.mp4> -qscale:v 2 -loglevel quiet -vf scale=<W>:<H> frame-%06d.jpg

The first one compresses output JPGs, the last one save JPGs with the best quality.

The example on 1, 2 and -qscale:v 31 (poorest quality). Please ignore H/W ratio, in testing I use correct H/W ration.

download (4) download (5) download (6)

The difference between 1 and 2 is small.

source	R@1	R@5	R@10	MedR
results in article	15.1	38	51.2	10
my retest. ffmpeg best quality	15.975	38.208	50.126	10
my retest. ffmpeg default quality	10.629	27.201	7.925	20

Note: some videos from YouCookII are unavailable today, so I tested only on available videos.

Despite small difference between 1 and 2, the test difference is sufficient. It may be because some intersection between YouCookII and HowTo100M wasn't filtered, and network learned some videos from this intersection.

My question is. Are you sure that intersection between YouCookII and HowTo100M was completely removed from train dataset? Could you post in this thread youtube video ids that was used for train? (or that was thrown away?). I want to do double check about intersection.

antoine77340 commented 4 years ago

Hi,

I am sure the YouCook2 test videos were removed from HowTo100M training videos. You can check it yourself by downloading the caption files from: https://www.di.ens.fr/willow/research/howto100m/ and checking the video ids of the HowTo100M dataset.

For me I can understand there is a drop in performance between the low quality and the standard quality decoding. CNN are known to be highly sensitive to texture instead of global shape see: https://openreview.net/forum?id=Bygh9j09KX (or https://arxiv.org/pdf/1604.04004.pdf) That's also why they are so vulnerable to adverserial perturbations. And the difference of picture I see from the good quality and the bad quality image show is significant in terms of texture thus, the difference of performance.

By the way, to me the default quality of ffmpeg is the best one and not the compressed one.

antoine77340 commented 4 years ago

Also one thing to note is that I think qscale does not control the overall jpeg compression level but controls the video compression variable bitrate frame by frame. Which means that the compression level is varying according to frames, so looking at only one frame is not very informative, especially if there is very little motion.

fake-warrior8 commented 2 years ago

Hi,

I am sure the YouCook2 test videos were removed from HowTo100M training videos. You can check it yourself by downloading the caption files from: https://www.di.ens.fr/willow/research/howto100m/ and checking the video ids of the HowTo100M dataset.

For me I can understand there is a drop in performance between the low quality and the standard quality decoding. CNN are known to be highly sensitive to texture instead of global shape see: https://openreview.net/forum?id=Bygh9j09KX (or https://arxiv.org/pdf/1604.04004.pdf) That's also why they are so vulnerable to adverserial perturbations. And the difference of picture I see from the good quality and the bad quality image show is significant in terms of texture thus, the difference of performance.

By the way, to me the default quality of ffmpeg is the best one and not the compressed one.

I ran your source code with the checkpoint you released and the YouCook2 dataset you given. The results of my evaluation on YouCook2 are R@1=12.68, R@5=34.01, R@10=46.90, Medium=12, which is much worse than the reported results R@1=15.1, R@10=38.0, R@10=51.2, Medium=10.. I used the default setting

python eval_youcook.py --batch_size=16  --num_thread_reader=20 --num_windows_test=10 \
        --eval_video_root=path_to_the_youcook_videos --pretrain_cnn_path=the_path_to_the_checkpoint

antoine77340 / S3D_HowTo100M

Dramatically accuracy drop with JPG compression #3