arunos728 / MotionSqueeze

Official PyTorch Implementation of MotionSqueeze, ECCV 2020
BSD 2-Clause "Simplified" License
139 stars 16 forks source link

How to understand 'clip'? #16

Closed litingfeng closed 3 years ago

litingfeng commented 3 years ago

Hi,

Thanks for the code sharing of this great work.

I have some questions regarding sec. 4.2, during inference,

Given a video, we sample a clip and test its center crop. For Something- Something V1&V2, we evaluate both the single clip prediction and the average prediction of 10 randomly-sampled clips.

May I ask:

  1. what is a 'clip' here? Is it some set of frames that we can consider to represent the whole video? How many frames does one clip contain?
  2. For something-something dataset, how did you sample one single clip? Is it also randomly-sampled as you did for 10 average prediction?
  3. Why did you use different sampling strategy for something-something (randomly) and kinectics&HMDB-51 (uniformly)? What are the advantages and disadvantages for each?

Your reply would be greatly appreciated.

arunos728 commented 3 years ago
  1. Yes, we sample 8 or 16 frames per video for Something-something & Kinetics datasets.
  2. We use the segment-based sampling strategy (Temporal segment networks, 2016) for Something-Something, and the uniform sampling strategy (non-local neural networks, 2018) for Kinetics. We use only a single clip for Something-Something.
  3. These sampling strategies are conventional experimental setups for both Something-something & Kinetics. Actually, videos of something-something (avg 4 seconds) are quite shorter than Kinetics (avg 10seconds), so many approaches usually use segment-based sampling, which covers the whole video length.