Closed rohit-gupta closed 7 years ago
Nevermind, found this in the paper.
All video frames are resized into 128×171. This is roughly half resolution of the UCF101 frames. Videos are split into non-overlapped 16-frame clips which are then used as input to the networks. The input dimensions are 3×16×128×171. We also use jittering by using random crops with a size of 3×16×112×112 of the input clips during training.
The input for the c3d model is of dimensions 3x16x112x112 (channels x timesteps x width x height), however the c3d_means file has a numpy array of size (3, 16, 128, 171).
I was wondering if this means the video has to be resized ? What should be done if I am using C3D as a feature extractor on a different dataset ?