Can't reproduce results for YouCookII

dzabraev commented 4 years ago

I took this model

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

and code from this repository, I take validation part of youcookII and try to achieve numbers mentioned in the article End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Capture

It is unclear which protocol did you use for testing. In the following table I show several experiments and none of them could achieve your results. Could you clarify which test protocol did you use for testing? It will be good if you publish script for testing.

What I try.

T is time in seconds. I split each clip to subclips each has length T seconds. For each subclip embedding will be compute.
pooling If clip was split to >1 subclips embeddings will be averaged to pooling
imgsz Short side of each source video will be rescaled to imgsz with h:w preserving. Then center crop will be taken for each frame.
normalize Whether or not sentence embedding and each video embedding was L2-normalized before dot-product.
num frames From each T-seconds clip num frames was taken in uniform style.
num resample For each clip sample different num resample sets of frames. For each resample compute embedding. With pooling all embedding will be polled to single one. LCR means sample from each clip 3 times: num frames left crops, num_frames right crops, num frames center crops.

T	imgsize	pooling	normalize	num frames	num resample	R@1	R@5	R@10	MedR
250	200	max	False	32	1	11.478	27.610	37.453	21
250	224	max	False	32	1	8.774	22.044	30.975	32
250	256	max	False	32	1	5.912	15.503	21.038	104
1.5	200	max	False	32	1	8.333	23.208	31.981	31
3.2	200	max	False	32	1	9.497	24.969	34.654	24
8	200	max	False	32	1	10.094	25.818	35.849	23
16	200	max	False	32	1	10.755	26.478	36.541	21
32	200	max	False	32	1	11.164	27.484	37.296	21
64	200	max	False	32	1	11.415	27.704	37.547	21
128	200	max	False	32	1	11.447	27.610	37.453	21
250	200	max	True	32	1	9.906	25.031	34.748	25
250	200	max	False	32	2	11.604	28.270	37.987	20
250	200	max	False	32	3	11.918	28.396	38.333	21
250	200	max	False	32	4	11.509	28.082	38.365	21
250	200	max	False	32	LCR	11.384	27.138	37.704	22
250	200	mean	False	32	4	12.075	28.805	38.459	20

antoine77340 commented 4 years ago

Hi,

Thank you for your comment here is how we have done it: Each testing video is sampled at 10 fps and rescaled so that min(height,width) = 224. For each YouCook2 video clip, we sampled 5 x 32-frame clips linearly spaced (so each clip is of 3.2 seconds), center crop it to 224x224, and compute the video embedding for each of them of size 512. And then we average pool the embeddings. Finally there was no normalization. I guess the main difference with what you are doing is that you are uniformly sampling the 32 frames over the whole video right ? Or are the 32 frames sampled, always subsequent frames ?

Also please make sure to put the model in eval mode, otherwise you will recompute batch norm statistics over running batches.

One thing to note is that this pytorch model is a port of the official tensorflow release model from: https://tfhub.dev/deepmind/mil-nce/s3d/1 I did convert the weights to pytorch and did run a benchmark on CrossTask to check if the numbers were similar but I did not check on YouCook2. If you still happen to have any problem, please let me know and I will check myself on YouCook2.

dzabraev commented 4 years ago

Do you L2-normalize text embeddings and video embeddings before avg-pooling? If not, is it ok that text embedding has L2-norm ~175 and video embedding ~0.25 ?

antoine77340 commented 4 years ago

No normalization is needed, I managed to rerun the YouCook2 evaluation using this pytorch model on a new code (different from my codebase at Deepmind) and with a validation set sligtly larger than the one I had at deepmind and got 49,5 in R@10. I assume there is a problem in how you sample the video clips of 32 frames. Are they always 32 contiguous frames ? if not then you might have some issues if you randomly sampled 32 frames within a large clip of 250 seconds.

dzabraev commented 4 years ago

Thank you for explanation. I succeed to get numbers from article. The main reason was in JPG compression. By default ffmpeg uses JPG compression when it doing unpacking video to images. I disabled compression and could manage to get required number.

xiangyh9988 commented 2 years ago

I disabled compression and could manage to get required number

Hi bro, sorry to bother you. Could you please share how to disable the JPG compression in ffmpeg-python? I try to search for the arguments but didn't find how to disable it.

dzabraev commented 2 years ago

Add -q:v 1 to ffmpeg arguments.
You can unpack video to bmp. BMP is lossless format. It will give you the best possible quality, but each image will have big size.

xiangyh9988 commented 2 years ago

Got it. Thank you. After seeing your another issue, I see that you used ffmpeg command-line to unpack videos. I misunderstood that and thought the compression need to be disabled in ffmpeg-python. My bad.

antoine77340 / S3D_HowTo100M

Can't reproduce results for YouCookII #2