alexandrosstergiou / Saliency-Tubes-Visual-Explanations-for-Spatio-Temporal-Convolutions

[ICIP 2019] Implementation of Saliency Tubes for 3D Convolutions in Pytoch and Keras to localise the focus spatio-temporal regions of 3D CNNs.
MIT License
51 stars 7 forks source link

Why can't we just backpropogate gradient to the input frames? #2

Closed zeal-up closed 5 years ago

zeal-up commented 5 years ago

If I have not missed something, the saliency map you get just the last convolutional layers's saliency map, and you reshape(zoom() function in your code) the saliency map to video dimensions. I think it will cause some unalignment problem.

So, why can't we use 2D method, just backpropagate to the input frames(a cube) to get a saliency cube? Do I miss something or understand wrong?

Thanks!

alexandrosstergiou commented 5 years ago

Good question.

The main problem that arise when using 3D Convolutions is that all the dimensions of the volume (in this case [# frames, width, height]) are decreased as more convolution blocks are added, while the feature space increases. In other words, when you reach the last few layers of the feature extractor, you have very small input sizes but very large feature spaces. When then wanting to represent this in the input you will inevitably have to deal with the curse of dimensionality.

Therefore, in order to "guess" (in reasonable bounds) how the activation maps of the last convolution layer would be represented in the same space as the input, we used spine interpolates - with the scipy.ndimage.zoom function - not only for rescaling individual frames, but also the entire video volume. As far as using a standard 2D method frame-wise, I am not exactly sure how you could achieve that if you don't have a way of representing the output/activation maps of a 3D convolution to the same dimensions as its inputs. Unless, you refer to using 3D Convolutions as 2D Convolutions, which then using a 2D method would indeed be reasonable.

Also, in order to create a (class) saliency tube, the activation maps are multiplied channel-wise by the corresponding feature vector, meaning that the number of channels need to be the same as those as the class vector.

Hope the above helped, Best.