NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

Sik-Ho Tang | Review: STN -- Spatial Transformer Network (Image Classification). #93

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: STN — Spatial Transformer Network (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, Spatial Transformer Network (STN), by Google DeepMind, is briefly reviewed. STN helps to crop out and scale-normalizes the appropriate region, which can simplify the subsequent classification task and lead to better classification performance as below: 1_P_nv_a_Q3LqM9d10XGE3TQ

image (a) Input Image with Random Translation, Scale, Rotation, and Clutter, (b) STN Applied to Input Image, (c) Output of STN, (d) Classification Prediction.

It is published in 2015 NIPS with more than 1300 citations. Spatial transformation such as affine transformation and homography registration has been studied for decades.

With learning-based spatial transformation, transformation is applied conditioned on input or feature map. And it is highly related to another paper called “Deformable Convolutional Networks” (2017 ICCV). Thus, I decided to read this first.

NorbertZheng commented 1 year ago

Quick Review on Spatial Transformation Matrices

There are mainly 3 transformation learnt by STN in the paper. Indeed, more sophisticated transformation can also be applied as well.

Affine Transformation

image Affine Transform.

Depending the values in the matrix, we can transform (X1, Y1) to (X2, Y2) with different effects, as follows: image Translation, Scaling, Rotation, and Shearing.

If interested, please Google “Registration”, “Homography Matrix”, or “Affine Transform”.

NorbertZheng commented 1 year ago

Projective Transformation

Projective transformation can also be learnt in STN as below: image Projective Transformation.

NorbertZheng commented 1 year ago

Thin Plate Spline (TPS) Transformation

image Thin Plate Spline (TPS) Transformation.

image An example.

For TPS transformation, it is more complicated compared with the previous two transformation. (I have learnt affine and projective mapping before, but I haven’t touched about TPS, if there is mistakes, please tell me.)

To be brief, suppose we have a point (x, y) at a location other than the input points (xi, yi), we use the equations at the right to transform the point based on a bias, weighted sum of x and y, and a function of distance between (x, y) and (xi, yi). (Here, a radial basis function RBF.)

Therefore, if we use TPS, the network needs to learn a0, a1, a2, b0, b1, b2, Fi, and Gi, which are 6+2N number of parameters.

NorbertZheng commented 1 year ago

Spatial Transformer Network (STN)

image image Affine Transformation.

STN is composed of Localisation Net, Grid Generator and Sampler.

Localisation Net

It can be learnt as affine transform as above. Or to be more constrained such as the used for attention which only contains scaling and translation as below: image Only scaling and translation.

Grid Generator

Sampler

image (a) Identity Transformation, (b) Affine Transformation.

NorbertZheng commented 1 year ago

Subject Layer in MEG decoding is just an affine transformation. We don't have expert domain knowledge, maybe affine transformation is already enough.

NorbertZheng commented 1 year ago

Sampling Kernel

As we can see in the example above, if we need to sample a transformed grid, we got sampling problem, how we sampling those sub-pel positions are depending on what sampling kernel we about to use.

NorbertZheng commented 1 year ago

Experimental Results

Distorted MNIST

image Errors of distorted MNIST datasets (Left), Some examples that are failed in CNN but successfully classified in STN (Right).

As we can see, ST-FCN outperforms FCN and ST-CNN outperforms CNN. And ST-CNN is consistently better than ST-FCN in all settings.

NorbertZheng commented 1 year ago

SVHN (Street View House Number)

image Errors of SVHN datasets (Left), Some examples use in ST-CNN (Right).

Similarly, ST-CNN outperforms Maxout and CNN. (I have a very brief introduction of Maxout in NoC, please read it if interested.) And ST-CNN Multi outperforms ST-CNN Single a bit.

NorbertZheng commented 1 year ago

Fine-Grained Classification

image Fine-Grained Bird Classification. Accuracy(left), 2×ST-CNN (Top Right Row), 4×ST-CNN (Bottom Right Row).

It is interesting that one ST (red) has learnt to be a head detector, with other 3 STs (green) learn the central part of the body of a bird.

NorbertZheng commented 1 year ago

Some Other Tasks

MNIST Addition

image MNIST Addition.

2×ST-CNN: It is interesting that each of ST learns to transform each of a digit though each ST also receives two input digits.

NorbertZheng commented 1 year ago

Co-localisation

image Co-localisation.

image image Triplet loss: Hinge loss is used to enforce the distance between the two outputs of the ST to be less than the distance to a random crop, hoping to encourage the spatial transformer to localise the common objects.

NorbertZheng commented 1 year ago

Higher Dimensional Transformers

image image STN can also be extended to be 3D affine transformation.

NorbertZheng commented 1 year ago

Reference