HumamAlwassel / XDC

Self-Supervised Learning by Cross-Modal Audio-Video Clustering (NeurIPS 2020)
http://humamalwassel.com/publication/xdc/
MIT License
90 stars 9 forks source link

Question of XDC #5

Closed MingYang-buaa closed 3 years ago

MingYang-buaa commented 3 years ago

Thanks for your sharing! I'm trying to reproducing XDC with tensorflow. And I have a few problems in the process of reproducing.

  1. How to prepare the training data. For a video, is it to use only one of the clips? Or to process all the clips like a sliding window?
  2. Is the pre-training process consistent with the fine-tuning process, except for hyperparameters?
HumamAlwassel commented 3 years ago

Hi @MingYang-buaa,

Thanks for your interest in our work.

  1. No, we don't apply a sliding window in training. Instead, we employ temporal jittering by sampling multiple fixed clips randomly from the video. We define the epoch size as the number of sampled clips across the full dataset. Please refer to the supplementary materials for the exact numbers.
  2. Yes it is. We also finetune with a larger clip size (32 frames) in the state-of-the-art comparison, as per the practice in other methods.

Cheers!