I have some questions regarding sec. 4.2, during inference,
Given a video, we sample a clip and test its center crop. For Something- Something V1&V2, we evaluate both the single clip prediction and the average prediction of 10 randomly-sampled clips.
May I ask:
what is a 'clip' here? Is it some set of frames that we can consider to represent the whole video? How many frames does one clip contain?
For something-something dataset, how did you sample one single clip? Is it also randomly-sampled as you did for 10 average prediction?
Why did you use different sampling strategy for something-something (randomly) and kinectics&HMDB-51 (uniformly)? What are the advantages and disadvantages for each?
Yes, we sample 8 or 16 frames per video for Something-something & Kinetics datasets.
We use the segment-based sampling strategy (Temporal segment networks, 2016) for Something-Something, and the uniform sampling strategy (non-local neural networks, 2018) for Kinetics. We use only a single clip for Something-Something.
These sampling strategies are conventional experimental setups for both Something-something & Kinetics. Actually, videos of something-something (avg 4 seconds) are quite shorter than Kinetics (avg 10seconds), so many approaches usually use segment-based sampling, which covers the whole video length.
Hi,
Thanks for the code sharing of this great work.
I have some questions regarding sec. 4.2, during inference,
Given a video, we sample a clip and test its center crop. For Something- Something V1&V2, we evaluate both the single clip prediction and the average prediction of 10 randomly-sampled clips.
May I ask:
Your reply would be greatly appreciated.