Hi, I have some questions regarding the C3D paper:
In "3.1. 3D convolution and pooling" , the part "Common network
settings" the paper states that : "Videos are split into non-overlapped 16-frame
clips which are then used as input to the networks"
But in "3.3. Spatiotemporal feature learning" in the part "C3D video
descriptor" the paper states: "To extract C3D feature, a video is split into
16 frame long clips with a 8-frame overlap between two consecutive
clips"
Does this mean that for training, the 16-frame clips do not overlap , but for feature extraction, the 16-frame clips should overlap by 8 frames, right?
So suppose I have a training set and a test set, and I wish to:
Finetuning C3D on the training set
Extract features on the training set after the finetuning part, then use SVM to train on the features
Extract features on the test set, then use SVM to test on the features
then for the 1st part, the clips do not overlap, but for the 2nd and 3rd part, the clips would overlap right?
Please help me, thank you very much
Same question about feature extractor. In the example code here, it seems extract frames every 60 secs. And didn't see where the L2 Norm implemented, as paper said there should be an L2 Norm after averaging the features.
Hi, I have some questions regarding the C3D paper: