20CVPR| A Multigrid Method for Efficiently Training Video Models

Paper & Code

Authors: Chao-Yuan Wu1,2 Ross Girshick2 Kaiming He2 Christoph Feichtenhofer2 Philipp Krahenb 1
1The University of Texas at Austin 2Facebook AI Research (FAIR)

Problem to be tackled High resolution models perform well, but train slowly. Low resolution models train faster, but are less accurate.
Trade-off the balance between compution allocated to processing more examples per mini-batch vs. the computation allocated to processing larger time and space dimensions.

Core observation: The underlying sampling grid that is used to train video models need not be constant during training.

Highlight

To avoid this trade-off, this paper proposed to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. This means with this strategy, we can have faster training without losing accuracy. different shapes: resampling the training data on multiple sampling grids.
sampling grids: it is specified by a temporal span, a spatial span, a temporal stride, and a spatial stride.

Methods

Baseline: a referebce video model (C3D, I3D) trained by a baseline mini-batch optimizer (SGD) that operates on mini-batches of shape BxTxHxW (mini-batch size x number of frames x height x width) for some number of epochs (e.g., 100).

This paper: consider temporal and spatial shapes t x w x h that are formed by resampling source videos with a new sampling grid that has its own spans and strides.

XFeiF / ComputerVision_PaperNotes

20CVPR| A Multigrid Method for Efficiently Training Video Models #21

Highlight

Methods