We propose a Spatiotemporal Sampling Network (STSN) that uses deformable
convolutions across time for object detection in videos. Our STSN performs
object detection in a video frame by learning to spatially sample features from
the adjacent frames. This naturally renders the approach robust to occlusion or
motion blur in individual frames. Our framework does not require additional
supervision, as it optimizes sampling locations directly with respect to object
detection performance. Our STSN outperforms the state-of-the-art on the
ImageNet VID dataset and compared to prior video object detection methods it
uses a simpler design, and does not require optical flow data for training. We
also show that after training STSN on videos, we can adapt it for object
detection in images, by adding and training a single deformable convolutional
layer on still-image data. This leads to improvements in accuracy compared to
traditional object detection in images.
Bertasius, Gedas, Torresani, Lorenzo, Shi, Jianbo
We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet VID dataset and compared to prior video object detection methods it uses a simpler design, and does not require optical flow data for training. We also show that after training STSN on videos, we can adapt it for object detection in images, by adding and training a single deformable convolutional layer on still-image data. This leads to improvements in accuracy compared to traditional object detection in images.
https://arxiv.org/abs/1803.05549