facebookresearch / AVID-CMA

Audio Visual Instance Discrimination with Cross-Modal Agreement
Other
127 stars 18 forks source link

Quetion about clips_per_video=10 in training Kinetics #8

Open russellllaputa opened 2 years ago

russellllaputa commented 2 years ago

Hi

Thank you for your excellent work and release of the code.

There is one thing in the code I am very confused about: why do you set clips_per_video=10 in your training script on Kinetics400. If I have not misunderstood, this will only make repeat the samples 10 times, and thus you trained the model for 30 epochs, which has the same effect as training for 300 epochs, as stated in your paper.

Did setting clips_per_video=10 result in a faster convergence in your training?

Thank you in advance for answering the question!

Bests,

pedro-morgado commented 2 years ago

Hi, Yes, it did. Setting the number of clips to 10 allows the sampling to be more "random". Since memories are updated when the video is sampled, at the end of the first epoch, the memories of most samples have been updated once, except for the samples that are left to train on. So the last few iteration of the first epoch, the model would end up learning to distinguish random noise (non-updated memories) from negative memories which have all been updated once. Because this task is artificially easy, the model would overfit to it easily, instance discrimination accuracy goes up (artificially) for a few iterations, but then drops significantly once the new epoch starts. Ideally, we could completely avoid this behavior by sampling batches with replacement, but to stay closer to prior codebases, we end up implementing it the way you see it. Hope this helps, Pedro