Closed NorbertZheng closed 1 year ago
In this story, Spatial Transformer Network (STN), by Google DeepMind, is briefly reviewed. STN helps to crop out and scale-normalizes the appropriate region, which can simplify the subsequent classification task and lead to better classification performance as below:
(a) Input Image with Random Translation, Scale, Rotation, and Clutter, (b) STN Applied to Input Image, (c) Output of STN, (d) Classification Prediction.
It is published in 2015 NIPS with more than 1300 citations. Spatial transformation such as affine transformation and homography registration has been studied for decades.
With learning-based spatial transformation, transformation is applied conditioned on input or feature map. And it is highly related to another paper called “Deformable Convolutional Networks” (2017 ICCV). Thus, I decided to read this first.
There are mainly 3 transformation learnt by STN in the paper. Indeed, more sophisticated transformation can also be applied as well.
Affine Transform.
Depending the values in the matrix, we can transform (X1, Y1) to (X2, Y2) with different effects, as follows: Translation, Scaling, Rotation, and Shearing.
If interested, please Google “Registration”, “Homography Matrix”, or “Affine Transform”.
Projective transformation can also be learnt in STN as below: Projective Transformation.
Thin Plate Spline (TPS) Transformation.
An example.
For TPS transformation, it is more complicated compared with the previous two transformation. (I have learnt affine and projective mapping before, but I haven’t touched about TPS, if there is mistakes, please tell me.)
To be brief, suppose we have a point (x, y) at a location other than the input points (xi, yi), we use the equations at the right to transform the point based on a bias, weighted sum of x and y, and a function of distance between (x, y) and (xi, yi). (Here, a radial basis function RBF.)
Therefore, if we use TPS, the network needs to learn a0, a1, a2, b0, b1, b2, Fi, and Gi, which are 6+2N number of parameters.
Affine Transformation.
STN is composed of Localisation Net, Grid Generator and Sampler.
It can be learnt as affine transform as above. Or to be more constrained such as the used for attention which only contains scaling and translation as below: Only scaling and translation.
Suppose we have a regular grid $G$, this $G$ is a set of points with target coordinates ($x^t_i$, $y^t_i$).
Then we apply transformation $T{\theta}$ on $G$, i.e. $T{\theta}(G)$.
After $T_{\theta}(G)$, a set of points with destination coordinates ($x^t_i$, $y^t_i$) is outputted. These points have been altered based on the transformation parameters. It can be Translation, Scale, Rotation or More Generic Warping depending on how we set $\theta$ as mentioned above.
(a) Identity Transformation, (b) Affine Transformation.
Subject Layer in MEG decoding is just an affine transformation. We don't have expert domain knowledge, maybe affine transformation is already enough.
As we can see in the example above, if we need to sample a transformed grid, we got sampling problem, how we sampling those sub-pel positions are depending on what sampling kernel we about to use.
Errors of distorted MNIST datasets (Left), Some examples that are failed in CNN but successfully classified in STN (Right).
As we can see, ST-FCN outperforms FCN and ST-CNN outperforms CNN. And ST-CNN is consistently better than ST-FCN in all settings.
Errors of SVHN datasets (Left), Some examples use in ST-CNN (Right).
Similarly, ST-CNN outperforms Maxout and CNN. (I have a very brief introduction of Maxout in NoC, please read it if interested.) And ST-CNN Multi outperforms ST-CNN Single a bit.
Fine-Grained Bird Classification. Accuracy(left), 2×ST-CNN (Top Right Row), 4×ST-CNN (Bottom Right Row).
It is interesting that one ST (red) has learnt to be a head detector, with other 3 STs (green) learn the central part of the body of a bird.
MNIST Addition.
2×ST-CNN: It is interesting that each of ST learns to transform each of a digit though each ST also receives two input digits.
Co-localisation.
Triplet loss: Hinge loss is used to enforce the distance between the two outputs of the ST to be less than the distance to a random crop, hoping to encourage the spatial transformer to localise the common objects.
STN can also be extended to be 3D affine transformation.
Sik-Ho Tang. Review: STN — Spatial Transformer Network (Image Classification).