Closed dzabraev closed 4 years ago
Hi,
Thank you for your comment here is how we have done it: Each testing video is sampled at 10 fps and rescaled so that min(height,width) = 224. For each YouCook2 video clip, we sampled 5 x 32-frame clips linearly spaced (so each clip is of 3.2 seconds), center crop it to 224x224, and compute the video embedding for each of them of size 512. And then we average pool the embeddings. Finally there was no normalization. I guess the main difference with what you are doing is that you are uniformly sampling the 32 frames over the whole video right ? Or are the 32 frames sampled, always subsequent frames ?
Also please make sure to put the model in eval mode, otherwise you will recompute batch norm statistics over running batches.
One thing to note is that this pytorch model is a port of the official tensorflow release model from: https://tfhub.dev/deepmind/mil-nce/s3d/1 I did convert the weights to pytorch and did run a benchmark on CrossTask to check if the numbers were similar but I did not check on YouCook2. If you still happen to have any problem, please let me know and I will check myself on YouCook2.
Do you L2-normalize text embeddings and video embeddings before avg-pooling? If not, is it ok that text embedding has L2-norm ~175 and video embedding ~0.25 ?
No normalization is needed, I managed to rerun the YouCook2 evaluation using this pytorch model on a new code (different from my codebase at Deepmind) and with a validation set sligtly larger than the one I had at deepmind and got 49,5 in R@10. I assume there is a problem in how you sample the video clips of 32 frames. Are they always 32 contiguous frames ? if not then you might have some issues if you randomly sampled 32 frames within a large clip of 250 seconds.
Thank you for explanation. I succeed to get numbers from article. The main reason was in JPG compression. By default ffmpeg uses JPG compression when it doing unpacking video to images. I disabled compression and could manage to get required number.
I disabled compression and could manage to get required number
Hi bro, sorry to bother you. Could you please share how to disable the JPG compression in ffmpeg-python? I try to search for the arguments but didn't find how to disable it.
Got it. Thank you. After seeing your another issue, I see that you used ffmpeg command-line to unpack videos. I misunderstood that and thought the compression need to be disabled in ffmpeg-python
. My bad.
I took this model
and code from this repository, I take validation part of youcookII and try to achieve numbers mentioned in the article End-to-End Learning of Visual Representations from Uncurated Instructional Videos
It is unclear which protocol did you use for testing. In the following table I show several experiments and none of them could achieve your results. Could you clarify which test protocol did you use for testing? It will be good if you publish script for testing.
What I try.
T
is time in seconds. I split each clip to subclips each has length T seconds. For each subclip embedding will be compute.pooling
If clip was split to >1 subclips embeddings will be averaged topooling
imgsz
Short side of each source video will be rescaled toimgsz
with h:w preserving. Then center crop will be taken for each frame.normalize
Whether or not sentence embedding and each video embedding was L2-normalized before dot-product.num frames
From eachT
-seconds clipnum frames
was taken in uniform style.num resample
For each clip sample differentnum resample
sets of frames. For each resample compute embedding. Withpooling
all embedding will be polled to single one. LCR means sample from each clip 3 times:num frames
left crops,num_frames
right crops,num frames
center crops.