Lasagne / Lasagne

Lightweight library to build and train neural networks in Theano
http://lasagne.readthedocs.org/
Other
3.84k stars 947 forks source link

TransformerLayer: unexpected shear artifact for non-square images #782

Open v-ablavsky opened 7 years ago

v-ablavsky commented 7 years ago

Given a non-square input (see attached examples) and a transform vector that specifies rotation and translation (i.e., a rigid transform), the transformed image computed by TransformerLayer appears to exhibit a shear artifact.

Attached files: (1) expected_sampling_grid.pdf <- sampling grid computed using my hand-crafted method that differs from the implementation in lasagne/layers/special.py

(2) unexpected_transformed_image.pdf <-- an output of SpatialTransformer that shows shear, even though the input transform was a rigid transform.

I conjecture that the issue stems from the implementation of scaling in _interpolate() "# scale coordinates from [-1, 1] to [0, width/height -1]"

expected_sampling_grid.pdf unexpected_transformed_image.pdf

f0k commented 7 years ago

Interesting. Possibly it wasn't originally developed for rectangular images. A network should be able to just learn to counter that, but it's still annoying you can't easily use it to apply a certain fixed transformation then (e.g., for data augmentation). Can you provide a simple script demonstrating the problem? If you have already looked into the details of how to create the sampling grid, do you know how this would be fixed?

v-ablavsky commented 7 years ago

Thanks for getting back to me so quickly! I am attaching a stand-alone script [issue_782.py packaged in a .zip file] that demonstrates this unexpected behavior. I will follow up tomorrow with some suggestions on how to address this issue. issue_782.zip

v-ablavsky commented 7 years ago

Here is my code to generate a sampling grid that does not exhibit unexpected/undesired shear artifacts. The key idea is to apply scaling of grid coordinates from [-1,1] to pixel space before applying Theta (the affine transform input to the TransformerLayer). The code below is not elegant, and any suggestions on making it simpler would be appreciated. As to integrating this computation into special.py, I'd propose to put it in _transform_affine( ), and be sure to remove the scaling computations from _interpolate( )

========================================= def compute_sampling_grid(theta, imgs, downsample_factor): """ Deep-dive into the computations performed by lasagne.layers.special.TransformerLayer in particular, this function pulls together code from _meshgrid() and _transform_affine() we thus get the warped sampling grid.

 """
floatX_t = theano.config.floatX

num_batch, num_channels, height, width = imgs.shape
theta = theta.reshape((2, 3))

out_height = np.array(height // downsample_factor[0]).astype('int64')
out_width = np.array(width // downsample_factor[1]).astype('int64')

# symbolic computations:
Imgs = T.tensor4()
Theta = T.matrix()

sym_nr = T.scalar()
sym_nc = T.scalar()
sym_nr_out = T.scalar()
sym_nc_out = T.scalar()

# Grid will be in [-1,1]x[-1,1]
# and will have num_rows, num_cols defined by "V" in Fig 3. of the paper
Grid = lasagne.layers.special._meshgrid(T.cast(sym_nr_out,'int64'), T.cast(sym_nc_out,'int64'))

# We now modify the transform Theta to not only apply the desired affine xform,
# but to also map the grid coords in the pixel space of image "U" in Fig. 3 of the paper
Scale_ = T.eye(3)
Scale__  = T.set_subtensor(Scale_[0,0],sym_nc/2.0)
Scale = T.set_subtensor(Scale__[1,1],sym_nr/2.0)
# we also adjust the original Theta so that the translation component is in the pixel space of U
Theta_ = T.set_subtensor(Theta[0,2],(0.5+Theta[0,2])*sym_nc)
Theta__ = T.set_subtensor(Theta_[1,2],(0.5+Theta[1,2])*sym_nr)

# this multiplication of two 3x3 matrices with block structure has the effect of changing the
# rotation component of Theta__ by scaling each coordinate *prior* to rotation
Theta2 = T.dot(Theta__,Scale)

Grid_xformed = T.dot(Theta2, Grid) # in pixel coords of U, but possibly outside image bounds

sym_x = Grid_xformed[0,:].flatten() 
sym_y = Grid_xformed[1,:].flatten()
sym_downsample = T.vector()

# clip to image bounds
sym_x = T.clip(sym_x, 0, sym_nc-1)
sym_y = T.clip(sym_y, 0, sym_nr-1)

sz = sym_x.shape[0]

sym_x = T.reshape(sym_x,(1,sz)) sym_y = T.reshape(sym_y,(1,sz)) Grid_xformed_pixels = T.concatenate([sym_x, sym_y], axis=0)

f_xform_grid = theano.function([Theta,sym_nr,sym_nc,sym_nr_out,sym_nc_out],[Theta2,Grid,Grid_xformed])

""" ...and so on... """

f0k commented 7 years ago

I am attaching a stand-alone script

Thank you for the clear demonstration!

The key idea is to apply scaling of grid coordinates from [-1,1] to pixel space before applying Theta (the affine transform input to the TransformerLayer).

I see.

The code below is not elegant, and any suggestions on making it simpler would be appreciated.

Well, it seems like the true solution would be defining the grid in pixel space from the beginning. But the question is how to go about the translation in Theta then. If we define this in pixel space as well, it will require values of a much larger magnitude than the scaling parameters, which is potentially bad for optimization. If we leave it in [-1,1] space, the horizontal and vertical translation are defined proportionally to the image size, so for a rectangular image, a translation of [0.1,0.1] will move by different amounts in x and y (that's what currently happens). A solution in between would be a space from -1 to 1 for the larger dimension, and a correspondingly smaller coordinate space for the smaller dimension.

Another aspect is the downscale argument. If this is non-proportional, what do we want it to do? Probably we'd also want this to be applied before any rotation in Theta, and not afterwards?

In the long term, it would be good to move this functionality into Theano. There are efficient implementations (e.g., in Torch or matconvnet), as well as cuDNN functions which split the implementation into a grid generator (from the transformation matrices) and bilinear sampler (from the grid and input image). These would probably be much faster than what we have now (they just weren't available at the time we added the spatial transformer to Lasagne).

Whatever we decide, we should try to take care that (a) existing networks can be run with future versions of Lasagne and (b) we can use existing implementations to replace the current vanilla Theano implementation in the future.

Would you be interested in bringing this to Theano, or at least have a look into how these existing implementations handle non-square inputs? Do they scale the image after Theta or before?

v-ablavsky commented 7 years ago

Thanks for articulating your strategy regarding fixing/improving the grid-generation code.

I looked at matconvnet and got some idea of the interplay between .m (grid generation) and .cu (bilinear sampling), but need to look at the code a bit more.

I am very curious to see what support Theano has to for these types of geometric computations, and I would be, in theory, interested in contributing. There are two caveats, however: (a) my timeline may or may not fit with Lasagne release schedules and (b) I'd need further discussions with Theano developer(s) (e.g., you) to quickly focus on and understand the necessary machinery.

Shall we continue via direct e-mail?

On Thu, Dec 15, 2016 at 8:17 AM, Jan Schlüter notifications@github.com wrote:

I am attaching a stand-alone script

Thank you for the clear demonstration!

The key idea is to apply scaling of grid coordinates from [-1,1] to pixel space before applying Theta (the affine transform input to the TransformerLayer).

I see.

The code below is not elegant, and any suggestions on making it simpler would be appreciated.

Well, it seems like the true solution would be defining the grid in pixel space from the beginning. But the question is how to go about the translation in Theta then. If we define this in pixel space as well, it will require values of a much larger magnitude than the scaling parameters, which is potentially bad for optimization. If we leave it in [-1,1] space, the horizontal and vertical translation are defined proportionally to the image size, so for a rectangular image, a translation of [0.1,0.1] will move by different amounts in x and y (that's what currently happens). A solution in between would be a space from -1 to 1 for the larger dimension, and a correspondingly smaller coordinate space for the smaller dimension.

Another aspect is the downscale argument. If this is non-proportional, what do we want it to do? Probably we'd also want this to be applied before any rotation in Theta, and not afterwards?

In the long term, it would be good to move this functionality into Theano. There are efficient implementations (e.g., in Torch https://github.com/qassemoquab/stnbhwd/tree/master/ or matconvnet https://github.com/vlfeat/matconvnet/blob/master/matlab/src/bits/impl/bilinearsampler_gpu.cu), as well as cuDNN functions which split the implementation into a grid generator (from the transformation matrices) and bilinear sampler (from the grid and input image). These would probably be much faster than what we have now (they just weren't available at the time we added the spatial transformer to Lasagne).

Whatever we decide, we should try to take care that (a) existing networks can be run with future versions of Lasagne and (b) we can use existing implementations to replace the current vanilla Theano implementation in the future.

Would you be interested in bringing this to Theano, or at least have a look into how these existing implementations handle non-square inputs? Do they scale the image after Theta or before?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Lasagne/Lasagne/issues/782#issuecomment-267324977, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNaB22_FhgS0nODvnMgF_2dDigEih_Cks5rIT3sgaJpZM4LL-oK .

f0k commented 7 years ago

I looked at matconvnet and got some idea of the interplay between .m (grid generation) and .cu (bilinear sampling), but need to look at the code a bit more.

Maybe the Torch code is easier to follow, or the API description in cuDNN (when you download cuDNN, you can also download the user reference with it).

I am very curious to see what support Theano has to for these types of geometric computations, and I would be, in theory, interested in contributing.

I think there is no explicit support yet: what we have in Lasagne was built from generic tensor operations already available in Theano. They will not be as efficient as a direct implementation of CUDA kernels.

(a) my timeline may or may not fit with Lasagne release schedules

There is no timeline for Lasagne, we have a set of issues to address before releasing 0.2 and will release it whenever these issues are resolved. It will go much faster if more users start contributing!

(b) I'd need further discussions with Theano developer(s) (e.g., you) to quickly focus on and understand the necessary machinery.

I'll be glad to help, and the Theano developer team probably as well. Let me know when you're ready to start, and we can set up a Theano issue for this.

Shall we continue via direct e-mail?

I think it may be valuable for others to just discuss here! But if you feel uncomfortable, feel free to drop me an email instead (you can find it via my homepage linked from my github profile).