Different data preparation methods during training and feature extraction?

Hi, I have some questions regarding the C3D paper:

In "3.1. 3D convolution and pooling" , the part "Common network settings" the paper states that : "Videos are split into non-overlapped 16-frame clips which are then used as input to the networks"
But in "3.3. Spatiotemporal feature learning" in the part "C3D video descriptor" the paper states: "To extract C3D feature, a video is split into 16 frame long clips with a 8-frame overlap between two consecutive clips" Does this mean that for training, the 16-frame clips do not overlap , but for feature extraction, the 16-frame clips should overlap by 8 frames, right? So suppose I have a training set and a test set, and I wish to:
Finetuning C3D on the training set
Extract features on the training set after the finetuning part, then use SVM to train on the features
Extract features on the test set, then use SVM to test on the features then for the 1st part, the clips do not overlap, but for the 2nd and 3rd part, the clips would overlap right? Please help me, thank you very much

facebookarchive / C3D