Why can't we just backpropogate gradient to the input frames?

Good question.

The main problem that arise when using 3D Convolutions is that all the dimensions of the volume (in this case [# frames, width, height]) are decreased as more convolution blocks are added, while the feature space increases. In other words, when you reach the last few layers of the feature extractor, you have very small input sizes but very large feature spaces. When then wanting to represent this in the input you will inevitably have to deal with the curse of dimensionality.

Therefore, in order to "guess" (in reasonable bounds) how the activation maps of the last convolution layer would be represented in the same space as the input, we used spine interpolates - with the scipy.ndimage.zoom function - not only for rescaling individual frames, but also the entire video volume. As far as using a standard 2D method frame-wise, I am not exactly sure how you could achieve that if you don't have a way of representing the output/activation maps of a 3D convolution to the same dimensions as its inputs. Unless, you refer to using 3D Convolutions as 2D Convolutions, which then using a 2D method would indeed be reasonable.

Also, in order to create a (class) saliency tube, the activation maps are multiplied channel-wise by the corresponding feature vector, meaning that the number of channels need to be the same as those as the class vector.

Hope the above helped, Best.

alexandrosstergiou / Saliency-Tubes-Visual-Explanations-for-Spatio-Temporal-Convolutions

Why can't we just backpropogate gradient to the input frames? #2