facebookresearch / TimeSformer

The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"
Other
1.55k stars 212 forks source link

input clips #86

Open W4ngH4o opened 3 years ago

W4ngH4o commented 3 years ago

I need help as a beginner, please! In the paper, what means 'frames sampled at a rate of 1/32'? And in the code, what relationship between sample_rate and target_fps. Why don't apply the TSN(Temporal Segment Networks) sampling strategy?

Thanks for your help.

gberta commented 2 years ago

frames sampled at a rate of 1/32, means that if there are 32 frames in the video, we will only sample one frame. Similarly, if there are 64 frames, we will sample 2 frames with a 32 frame gap between these two frames. You can extend this trend to any number of frames.

Target FPS just means the FPS that will be used to decode the raw video. Sample_rate indicates the gap (in terms of the number of frames) between adjacently sampled frames. Hope this helps.

syamamo1 commented 1 year ago

The paper says it uses a sample rate of 1/32 for 8-frame videos. How does this make sense? Does that mean that 8/32=1/4 of a frame is used?? Or does it mean that the original video is 8*32 = 256 frames and the model samples at a 1/32 rate so 8 frames out of the 256 total frames are used? Thanks!