Residual learning - Githubissues

kshmelkov commented 8 years ago

I have just read MSR report on winning ImageNet 152-layers network: http://arxiv.org/abs/1512.03385 I suggest to discuss how to implement residual blocks flexibly in Lasagne. For those who don't have time to read the paper I just write down the key idea: y = ReLU(BN(F(x) + x)) where F(x) is a couple of conv layers (in paper it is F(x) = conv(ReLU(BN(conv(x)))). If dimensions of x and F(x) don't match, just project x linearly.

Do we have a flexible way to manage subnetworks?

f0k commented 8 years ago

Lasagne already has everything you need. When building your network stack, you can add a shortcut from two layers before whenever you want:

...
layer = Conv2DLayer(layer, ...)
layer = Conv2DLayer(layer, ...)
layer = ElemwiseSumLayer([layer, layer.input_layer.input_layer])
...

You can write a simple helper function inserting a shortcut layer, or a helper function creating such a stack of two convolutions and a shortcut:

def residual_block(layer, num_filters, filter_size=3, num_layers=2):
    conv = layer
    for _ in range(num_layers):
        conv = Conv2DLayer(conv, num_filters, filter_size, pad='same')
    return ElemwiseSumLayer([conv, layer])

You can even include changing the number of channels or size of the feature maps (the dotted lines in Figure 3 of the paper):

def residual_block(layer, num_filters, filter_size=3, stride=1, num_layers=2):
    conv = layer
    if (num_filters != layer.output_shape[1]) or (stride != 1):
        layer = Conv2DLayer(layer, num_filters, filter_size=1, stride=stride, pad=0, nonlinearity=None, b=None)
    for _ in range(num_layers):
        conv = Conv2DLayer(conv, num_filters, filter_size, pad='same')
    return ElemwiseSumLayer([conv, layer])

Now you can easily define the network of Figure 3 in terms of these blocks:

layer = InputLayer(...)
layer = Conv2DLayer(layer, 64, 7, stride=2, pad='same')
layer = Pool2DLayer(layer, 2)
for _ in range(3):
    layer = residual_block(layer, 64)
layer = residual_block(layer, 128, stride=2)
for _ in range(3):
    layer = residual_block(layer, 128)
layer = residual_block(layer, 256, stride=2)
for _ in range(5):
    layer = residual_block(layer, 256)
layer = residual_block(layer, 512, stride=2)
for _ in range(2):
    layer = residual_block(layer, 512)
# avg pool, then fully-connect

Feel free to post again if you think there's anything missing, otherwise please close the issue! The github issue tracker should be reserved for bug reports and feature discussions. Cheers!

kshmelkov commented 8 years ago

Thank you for the explanation. Probably I didn't really explain my question, rereading it now I realize that it isn't clear at all. I didn't ask how to implement this particular architecture (it is evident enough). I suggest to discuss how to properly abstract a parametrised subnetwork. It is not the first and definitely not the last moduled architecture. May be we need an explicit concept of block? Here a macro-function works well. For Inception it becomes more tedious. I am aware about existing implementation in recipes and it shows that such macro-function handles parametrisation quite bad.

Another example is recurrent architectures like Conv-GRU. Here macro-function doesn't work because we need to pass a subnetwork inside.

Do you think that a parametrised subnet would be an useful concept for Lasagne?

ebenolson commented 8 years ago

Can you elaborate on the shortcomings of using a macro function to create modules? I don't really see the problem.

f0k commented 8 years ago

Another example is recurrent architectures like Conv-GRU. Here macro-function doesn't work because we need to pass a subnetwork inside.

This is something to be discussed in #425. For this we need a clean way of passing an arbitrary subnetwork to be encapsulated in a scan() function, and I don't know yet what it should look like. Suggestions are welcome.

I'm not sure how it would help for building a standard network such as GoogLeNet, though, macro functions seem fine: https://github.com/Lasagne/Recipes/blob/master/modelzoo/googlenet.py

kshmelkov commented 8 years ago

@ebenolson Parameters become disconnected from layers. Suppose you want to pass two sets of parameters: number of filters and kernel size. Either they are two separate lists like [64, 128, 64, 32], [1, 3, 5, 7] (even worse if they have different length as strides might be implemented) or tuples. Both compromises hurt readability a lot.

f0k commented 8 years ago

Both compromises hurt readability a lot.

But what would it look like without a macro function? Can you give an example of what you have in mind? (Anything is fine: an inception module, a residual block, a toy example...)

kshmelkov commented 8 years ago

Actually I don't have an exact idea in mind how to decouple parametrisation and network topology. I was thinking about proper dispatching arguments in corresponding layers' **kwargs (in order to keep together related parameters), but any implementation I think of is very close to network specs like in nolearn. I remember there were some reasons why it isn't implemented in Lasagne in the first place.

f0k commented 8 years ago

I was thinking about proper dispatching arguments in corresponding layers' **kwargs

Until proven otherwise, I'm convinced that whatever scheme you can think of is also possible with a macro function -- you don't have to use lists for parameters that hurt readability. I'm very open to a discussion, though!

ebenolson commented 8 years ago

Ok, I think I see your point now... If a module/block consists of a group of modifications applied to a (perhaps heavily parametrized) Layer, it's not very clean to have to write a macro function that passes through all the parameters to the Layer constructor. I don't have a solution, but I think it's worth thinking about further.

f0k commented 8 years ago

If a module/block consists of a group of modifications applied to a (perhaps heavily parametrized) Layer, it's not very clean to have to write a macro function that passes through all the parameters to the Layer constructor.

Ah, I see. You could of course stack macro functions for this, with the inner ones pre-parameterized using functools.partial so the outer macro function doesn't have to know anything about the inner macro function's parameters.

/edit: But we could really try to come up with something more convenient for this. For the recurrent layer container, I was thinking that you'd create the subnetwork beforehands, using InputLayers as placeholders for things provided by the container, and then tell the container which variables are the loose ends. It might be useful to have a notion of a network with input and output variables instead. That would basically be a model class, though, something we've got around so far. And it would not encapsulate a parameterizable subnetwork, but a concrete instance with a concrete parameterization. It seems the recurrent containers have less to do with this than I intuitively thought.

CheRaissi commented 8 years ago

def residual_block(layer, num_filters, filter_size=3, stride=1, num_layers=2):
    conv = layer
    if (num_filters != layer.output_shape[1]) or (stride != 1):
        layer = Conv2DLayer(layer, num_filters, filter_size=1, stride=stride, pad=0, nonlinearity=None, b=None)
    for _ in range(num_layers):
        conv = Conv2DLayer(conv, num_filters, filter_size, pad='same')
    return ElemwiseSumLayer([conv, layer])

I think there is an error in this code as the ElemwiseSumLayer() will have two different shapes as input. I think the conv = layer should be before the for ...loop. However, with this correction, I am only having NaNs while training. Any ideas?

f0k commented 8 years ago

I think there is an error in this code as the ElemwiseSumLayer() will have two different shapes as input.

Ah, yes. The first Conv2DLayer in the stack of actual convolutions has to respect the stride.

def residual_block(layer, num_filters, filter_size=3, stride=1, num_layers=2):
    conv = layer
    if (num_filters != layer.output_shape[1]) or (stride != 1):
        layer = Conv2DLayer(layer, num_filters, filter_size=1, stride=stride, pad=0, nonlinearity=None, b=None)
    for _ in range(num_layers):
        conv = Conv2DLayer(conv, num_filters, filter_size, stride=stride, pad='same')
        stride = 1
    return ElemwiseSumLayer([conv, layer])

I think the conv = layer should be before the for ... loop.

No, then the stack of convolutions would be on top of the 1x1 convolution layer meant as the shortcut (the dotted arrow in the figure in the paper).

jonathanstrong commented 8 years ago

fyi, the example above from @f0k does not work. I got it to work by removing the calls to residual_block([...], stride=2) in between each block. I believe the if condition in the function was triggering the same thing as the residual_block([...], stride=2) call and thus making it duplicative.

@f0k does this look right - this one actually works:

from lasagne.layers import Conv2DLayer, InputLayer, Pool2DLayer, ElemwiseSumLayer
import theano
import theano.tensor as T

def residual_block(layer, num_filters, filter_size=3, stride=1, num_layers=2):
    conv = layer
    if (num_filters != layer.output_shape[1]) or (stride != 1):
        layer = Conv2DLayer(layer, num_filters, filter_size=1, stride=stride, pad='same', nonlinearity=None, b=None)
        print lasagne.layers.get_output_shape(layer)
    for _ in range(num_layers):
        conv = Conv2DLayer(conv, num_filters, filter_size, pad='same')
    return ElemwiseSumLayer([conv, layer])

input_var = T.tensor4('inputs')

layer = InputLayer((256, 3, 224, 224), input_var=input_var)
layer = Conv2DLayer(layer, 64, 7, stride=2, pad='same')
layer = Pool2DLayer(layer, 2)
for _ in range(3):
    layer = residual_block(layer, 64)

for _ in range(3):
    layer = residual_block(layer, 128)

for _ in range(5):
    layer = residual_block(layer, 256)

for _ in range(2):
    layer = residual_block(layer, 512)

pikqu commented 8 years ago

Correct me if i am wrong but by reading the article i understood that,

def residual_block(layer, num_filters, filter_size=3, num_layers=2):
    conv = layer
    for _ in range(num_layers):
        conv = Conv2DLayer(conv, num_filters, filter_size, pad='same')
    return ElemwiseSumLayer([conv, layer])

provided by @f0k should be

def residual_block_2(layer, num_filters, filter_size=3, num_layers=2):
    conv = layer
    conv = Conv2DDNNLayer(conv, num_filters, filter_size, pad='same')
    conv = Conv2DDNNLayer(conv, num_filters, filter_size, pad='same',nonlinearity=None)
    es = ElemwiseSumLayer([conv, layer])
    return NonlinearityLayer(es)

f0k commented 8 years ago

fyi, the example above from @f0k does not work.

Disclaimer: I just quickly wrote it down after skimming the paper (Figure 3 and the accompanying description), it was never tested.

does this look right - this one actually works:

No, it doesn't look right, now you're not downsampling at all when increasing the channel count. Did you try the corrected version in https://github.com/Lasagne/Lasagne/issues/531#issuecomment-164776274?

Correct me if i am wrong but by reading the article i understood that [...]

Yes, good catch, I hadn't read the full article. Should have mentioned right away that this was just meant to convey the general idea. So this one should be closer to the truth (but still untested, and missing batch normalization):

def residual_block(layer, num_filters, filter_size=3, stride=1, num_layers=2):
    conv = layer
    if (num_filters != layer.output_shape[1]) or (stride != 1):
        layer = Conv2DLayer(layer, num_filters, filter_size=1, stride=stride, pad=0, nonlinearity=None, b=None)
    for _ in range(num_layers):
        conv = Conv2DLayer(conv, num_filters, filter_size, stride=stride, pad='same')
        stride = 1
    nonlinearity = conv.nonlinearity
    conv.nonlinearity = lasagne.nonlinearities.identity
    return NonlinearityLayer(ElemwiseSumLayer([conv, layer]), nonlinearity)

layer = InputLayer(...)
layer = Conv2DLayer(layer, 64, 7, stride=2, pad='same')
layer = Pool2DLayer(layer, 2)
for _ in range(3):
    layer = residual_block(layer, 64)
layer = residual_block(layer, 128, stride=2)
for _ in range(3):
    layer = residual_block(layer, 128)
layer = residual_block(layer, 256, stride=2)
for _ in range(5):
    layer = residual_block(layer, 256)
layer = residual_block(layer, 512, stride=2)
for _ in range(2):
    layer = residual_block(layer, 512)
# avg pool, then fully-connect

If anybody has the time to finish this and replicate some of the experiments in the paper, please do submit a PR to our Lasagne/Recipes collection!

jonathanstrong commented 8 years ago

I have the "bottleneck" architecture training successfully on my data and the CIFAR-10 model that matches their parameter estimates already, so could submit to recipes. I think you should check my work though.

here is the core of the bottleneck version:

def bottleneck_block(incoming, num_filters, filter_size, bottleneck_size=None, ConvLayer=lasagne.layers.dnn.Conv2DDNNLayer):
    """
    e.g. incoming ndim 256 -> 64 filters, size (1, 1) -> 64 filters, size (3, 3) -> 256 filters, size (1, 1)
    """
    if bottleneck_size is None:
        bottleneck_size = num_filters / 4
    conv = ConvLayer(incoming, bottleneck_size, 1)
    conv = ConvLayer(conv, bottleneck_size, filter_size, pad='same')
    conv = ConvLayer(conv, num_filters, 1)
    return lasagne.layers.ElemwiseSumLayer([conv, incoming])

In between sets of blocks for a given filter size, you change size with a stride 2, like from 256 to 512:

network = ConvLayer(network, num_filters=512, filter_size=1, stride=2, nonlinearity=None, b=None)

Note, when I first got this up and running with Conv2DLayer it was using CPU on ConvGrad3D. Only after I found a post from Sander about this did I switch to the dnn version and get it going much faster. That's why I put ConvLayer as an option in the function.

f0k commented 8 years ago

I have the "bottleneck" architecture training successfully on my data and the CIFAR-10 model that matches their parameter estimates already, so could submit to recipes. I think you should check my work though.

Sure, your PR will be scrutinized by us before merging :) If you're aiming for reproducing an experiment from the paper, it should go into the papers/ subdirectory (either as a single standalone script, or into a subdirectory if you've got multiple files). It should reproduce some of the results in the paper then (e.g., match the reported accuracy).

here is the core of the bottleneck version:

The last convolution must have nonlinearity=None, and there must be a NonlinearityLayer on top of the ElemwiseSumLayer.

In between sets of blocks for a given filter size, you change size with a stride 2, like from 256 to 512

Still haven't read the paper, but I expect you'd use this a projection shortcut (the dotted arrows in Figure 3) as in my residual_block(), not just insert it in between the blocks.

Note, when I first got this up and running with Conv2DLayer it was using CPU on ConvGrad3D.

Note that this will be fixed with #524.

auduno commented 8 years ago

@jonathanstrong how long does it take to train the model on CIFAR10? Would love to see a finished recipe!

JesseBuesking commented 8 years ago

@f0k Your version implements the projection shortcut, or version (B) in the paper (projecting using a 1x1 convolution when stride=2). How would you go about implementing the zero-padded version, version (A)? I'm mostly unsure of what they mean in the paper when they discuss version (A).

benanne commented 8 years ago

as I understood it, they just add a bunch of all-zero feature maps. This basically means that all the additional features in excess of the number of input features are not residual. You should be able to achieve this with lasagne.layers.pad: http://lasagne.readthedocs.org/en/latest/modules/layers/shape.html#lasagne.layers.PadLayer

JesseBuesking commented 8 years ago

Also, doing a convolution with stride 2 results in much slower performance. Using Conv2DCCLayer should work around this from what I've read, however it errors out with

ValueError: stride 2 greater than filter size (1, 1)

Any thoughts on a way to get around this? For now I'm just falling back to max pooling instead of convolutions with stride 2.

ebenolson commented 8 years ago

Conv2DDNNLayer with stride 2 should be fine.

JesseBuesking commented 8 years ago

@ebenolson yep, Conv2DDNNLayer works well, thank you!

@benanne I have yet to try it out, but thank you for pointing me in the right direction!

benanne commented 8 years ago

Do they have a stride 2 1x1 convolution in the paper? Don't they do that only for the 3x3 convolution? I don't know, I haven't checked. But a stride 2 1x1 convolution is weird, you're throwing away 75% of the information in the input then, so why compute it in the first place :)

JesseBuesking commented 8 years ago

I agree, it does sound weird. Based on the code above,

def residual_block(layer, num_filters, filter_size=3, stride=1, num_layers=2):
    conv = layer
    if (num_filters != layer.output_shape[1]) or (stride != 1):
        layer = Conv2DLayer(layer, num_filters, filter_size=1,
            stride=stride, pad=0, nonlinearity=None, b=None)
    ...

this will occur whenever stride > 1, or whenever you're downsampling to a smaller height/width.

Also if you're doing a bottleneck block (1x1 -> 3x3 -> 1x1 from section 4.1 in the paper) you'll be doing the same. From what I read in the paper, they're applying the stride 2 directly to the 1x1 convolution in the bottleneck so we'd run into the same issue here as well. I may definitely be wrong.

f0k commented 8 years ago

Based on the code above,

Which is what I derived from what's written on page 4, top right:

When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1x1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

Specifically, that's what option (B) says. Perform a 1x1 convolution for the shortcut if the number of feature maps changes, and use stride 2 if the spatial dimensionality decreases.

But a stride 2 1x1 convolution is weird, you're throwing away 75% of the information in the input then, so why compute it in the first place :)

Because it's only used for the shortcut. A 1x1 convolution of stride 2 is a very naive way of downsampling, but it's just to provide something for the parallel 3x3 convolutions to learn residuals for (the 3x3 convolutions can still make use of the remaining 75% of the information, so it's not computed for nothing). Figure 3 in http://arxiv.org/abs/1512.03385 depicts this, the dotted arrows are 1x1//2 convolutions (if you follow option (B) explained on the top right of the same page).

yep, Conv2DDNNLayer works well, thank you!

Again, note that this will be fixed with #524!

auduno commented 8 years ago

I managed to reproduce results on CIFAR-10 with a 32-layer network (and batchnormalization from #467 ) as described in the paper, so unless @jonathanstrong uploads an example, I can probably do so. My results gave 7.25 accuracy versus the 7.51 accuracy in the paper. I couldn't find a way to create identity layers with strides though (as needed in the shortcut connections with dimension increase), so I used average pooling instead. I think there might be some other differences I haven't figured out as well, since the model in the paper seems to learn faster and more stably than the architecture I tried, though the final accuracy is similar.

benanne commented 8 years ago

Cool! This would be a really great addition to our Recipes :)

stupidZZ commented 8 years ago

@auduno I try to reproduce the result on CIFAR-10, but unfortunately I failed, so would u mind share your code?

bobchennan commented 8 years ago

@stupidZZ Do you mean 7.25 error rate?

I couldn't find a way to create identity layers with strides though (as needed in the shortcut connections with dimension increase), so I used average pooling instead.

If I understand write, they didn't use shortcuts when the dimension increase(according to the number of shortcuts 3n). Also they claimed that only identity functions were used.

I also have done some experiments using bottleneck blocks(3n layers instead of 2n each block) but the result isn't very close. I didn't implement the preprocessing part:

We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip.

So would you mind share your code at first?

stupidZZ commented 8 years ago

@bobchennan, of course, but we just got 0.11% test error rate on CIFAR-10, and we implement my resnet by torch. You can find the code in https://github.com/bgshih/cifar.torch/blob/master/models/resnet.lua.

bobchennan commented 8 years ago

@stupidZZ Thanks a lot! I think there are something different in the output part. The paper mentioned that:

The network ends with a global average pooling, a 10-way fully-connected layer, and softmax.

It was inspired by the paper Network in Network. You can reference the implementation here. So it should look like this(in python):

net['cccp6'] = ConvLayer(net['cccp5'], num_filters=10, filter_size=1)
net['pool3'] = PoolLayer(net['cccp6'],
                         pool_size=8,
                         mode='average_exc_pad',
                         ignore_border=False)
net['output'] = FlattenLayer(net['pool3'])

I not quite familiar with Torch so please correct me if i am wrong.

stupidZZ commented 8 years ago

@bobchennan The way I understand about the output part which is mentioned in the paper is: first using a average pooling(with the kernel size equal to the size of feature map), then add a FC layer and softmax layer.

your code remove the FC layer and looks like a fully-convolution networks?

kshmelkov commented 8 years ago

AFAIK, Lasagne has GlobalPoolLayer for NIN-type pooling. No idea why it isn't used in recipes though. Probably it was introduced later.

bobchennan commented 8 years ago

@stupidZZ In my opinion I think the implementation omitted the softmax layer. In paper Network in Network, they mentioned:

In this paper, we propose another strategy called global average pooling to replace
the traditional fully connected layers in CNN. The idea is to generate one feature 
map for each corresponding category of the classification task in the last mlpconv
layer. Instead of adding fully connected layers on top of the feature maps, 
we take the average of each feature map, and the resulting vector is fed directly 
into the softmax layer.

So in the output part, I added a convLayer which has 10 filters. And then I used the poollayer as well as softmax layer.

stupidZZ commented 8 years ago

@bobchennan I don't know which one is correct, but at least my code is not work well on CIFAR-10 :-(

auduno commented 8 years ago

I made a pull request with my code here. Retraining the 32-layer model with a pooling layer with pooling size 1 and stride 2 (instead of average pooling with pooling size 2) to get the identity shortcuts, I managed to get a validation error of 6.88%.

bobchennan commented 8 years ago

Anyone who follow this work? They proposed a promising structure which show good result:

Remarkably, our 32 layer wider model performs similar to a 1001 layer ResNet model.

f0k commented 8 years ago

No, but Wide Residual Networks:

For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR-10, CIFAR-100 and SVHN.

See here: https://groups.google.com/d/msg/lasagne-users/Jwi00hYdDVs/yPZBy9v-EQAJ

Lasagne / Lasagne

Residual learning #531