antoine77340 / S3D_HowTo100M

S3D Text-Video model trained on HowTo100M using MIL-NCE
Apache License 2.0
191 stars 21 forks source link

Can't reproduce results for YouCookII #2

Closed dzabraev closed 4 years ago

dzabraev commented 4 years ago

I took this model

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

and code from this repository, I take validation part of youcookII and try to achieve numbers mentioned in the article End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Capture

It is unclear which protocol did you use for testing. In the following table I show several experiments and none of them could achieve your results. Could you clarify which test protocol did you use for testing? It will be good if you publish script for testing.

What I try.

T imgsize pooling normalize num frames num resample R@1 R@5 R@10 MedR
250 200 max False 32 1 11.478 27.610 37.453 21
250 224 max False 32 1 8.774 22.044 30.975 32
250 256 max False 32 1 5.912 15.503 21.038 104
1.5 200 max False 32 1 8.333 23.208 31.981 31
3.2 200 max False 32 1 9.497 24.969 34.654 24
8 200 max False 32 1 10.094 25.818 35.849 23
16 200 max False 32 1 10.755 26.478 36.541 21
32 200 max False 32 1 11.164 27.484 37.296 21
64 200 max False 32 1 11.415 27.704 37.547 21
128 200 max False 32 1 11.447 27.610 37.453 21
250 200 max True 32 1 9.906 25.031 34.748 25
250 200 max False 32 2 11.604 28.270 37.987 20
250 200 max False 32 3 11.918 28.396 38.333 21
250 200 max False 32 4 11.509 28.082 38.365 21
250 200 max False 32 LCR 11.384 27.138 37.704 22
250 200 mean False 32 4 12.075 28.805 38.459 20
antoine77340 commented 4 years ago

Hi,

Thank you for your comment here is how we have done it: Each testing video is sampled at 10 fps and rescaled so that min(height,width) = 224. For each YouCook2 video clip, we sampled 5 x 32-frame clips linearly spaced (so each clip is of 3.2 seconds), center crop it to 224x224, and compute the video embedding for each of them of size 512. And then we average pool the embeddings. Finally there was no normalization. I guess the main difference with what you are doing is that you are uniformly sampling the 32 frames over the whole video right ? Or are the 32 frames sampled, always subsequent frames ?

Also please make sure to put the model in eval mode, otherwise you will recompute batch norm statistics over running batches.

One thing to note is that this pytorch model is a port of the official tensorflow release model from: https://tfhub.dev/deepmind/mil-nce/s3d/1 I did convert the weights to pytorch and did run a benchmark on CrossTask to check if the numbers were similar but I did not check on YouCook2. If you still happen to have any problem, please let me know and I will check myself on YouCook2.

dzabraev commented 4 years ago

Do you L2-normalize text embeddings and video embeddings before avg-pooling? If not, is it ok that text embedding has L2-norm ~175 and video embedding ~0.25 ?

antoine77340 commented 4 years ago

No normalization is needed, I managed to rerun the YouCook2 evaluation using this pytorch model on a new code (different from my codebase at Deepmind) and with a validation set sligtly larger than the one I had at deepmind and got 49,5 in R@10. I assume there is a problem in how you sample the video clips of 32 frames. Are they always 32 contiguous frames ? if not then you might have some issues if you randomly sampled 32 frames within a large clip of 250 seconds.

dzabraev commented 4 years ago

Thank you for explanation. I succeed to get numbers from article. The main reason was in JPG compression. By default ffmpeg uses JPG compression when it doing unpacking video to images. I disabled compression and could manage to get required number.

xiangyh9988 commented 2 years ago

I disabled compression and could manage to get required number

Hi bro, sorry to bother you. Could you please share how to disable the JPG compression in ffmpeg-python? I try to search for the arguments but didn't find how to disable it.

dzabraev commented 2 years ago
  1. Add -q:v 1 to ffmpeg arguments.
  2. You can unpack video to bmp. BMP is lossless format. It will give you the best possible quality, but each image will have big size.
xiangyh9988 commented 2 years ago

Got it. Thank you. After seeing your another issue, I see that you used ffmpeg command-line to unpack videos. I misunderstood that and thought the compression need to be disabled in ffmpeg-python. My bad.