No, we don't apply a sliding window in training. Instead, we employ temporal jittering by sampling multiple fixed clips randomly from the video. We define the epoch size as the number of sampled clips across the full dataset. Please refer to the supplementary materials for the exact numbers.
Yes it is. We also finetune with a larger clip size (32 frames) in the state-of-the-art comparison, as per the practice in other methods.
Thanks for your sharing! I'm trying to reproducing XDC with tensorflow. And I have a few problems in the process of reproducing.