Batch Normalization - Githubissues

skaae commented 9 years ago

Moved from https://github.com/benanne/Lasagne/issues/133

diogo149 commented 9 years ago

Something relevant that I referenced in https://github.com/benanne/Lasagne/issues/136: "If batch normalization mean/variance counts as a param, then SGD update code won't work with it. If it doesn't count as a param, users relying on get_all_params for saving model state will have their code break (because the mean/variance won't be saved)."

ebattenberg commented 9 years ago

Something relevant that I referenced in #136: "If batch normalization mean/variance counts as a param, then SGD update code won't work with it. If it doesn't count as a param, users relying on get_all_params for saving model state will have their code break (because the mean/variance won't be saved)."

I thought the mean/variance were computed for each batch, so they would be intermediate variables and not shared variables/params. At test time though, they become parameters, but not learned parameters, more like arguments to the function computing the outputs.

The beta and gamma parameters (from the paper) that transform the normalized outputs are learned parameters though.

diogo149 commented 9 years ago

I did mean at test time, sorry

ebattenberg commented 9 years ago

I think that the mean/variance of each unit across the entire training set are pretty specialized parameters in that they have to be computed iteratively layer-by-layer before test time and then communicated to the network at test time. Because this is different than what is done during training, I'm not sure it makes sense to support this case in the general model serialization code (unless you want to write your own version of model serialization that appends the mean/variance to the serialized model). Though I haven't dealt with an actual implementation of Batch Normalization yet.

benanne commented 9 years ago

That's true, but you also want to get a validation estimate during training so you can track progress. The way they propose to do this in the paper is by using a moving average of the per-batch estimates from training (I'm guessing they mean an exponential moving average which can be implemented more memory-efficiently). But I suppose that's not the only way to do it.

If we do decide to add support code for either of these things (validation estimation, test time estimation), I definitely think it should be separate from our get_params infrastructure, which is for learnable parameters only.

I guess this shows that get_params is a bit overloaded: it is used both for getting the learnable parameters of a layer, and for getting its state. It seems like those two things do not necessarily overlap completely in all cases. But I don't think this is a strong enough use case to split it up into two separate methods, since they do actually overlap for almost all other use cases.

diogo149 commented 9 years ago

I agree. I was thinking about this more, and it seems possible to solve this by providing an alternate route for parameter serialization (maybe implementing __getstate__ and __setstate__ to not serialize any theano variables, which would allow serialization of entire models without using get_all_params, and would generally make models easier to read and write).

benanne commented 9 years ago

If it's easy to implement those things in the base class and we don't have to require from users to implement them for their own Layer subclasses, we can look into it. Otherwise I would consider this a serious increase in complexity for the use case of implementing your own layer, which I definitely think we should avoid.

takacsg84 commented 9 years ago

Have you guys had any success with this? I gave it a try, and it truly does wonders with the training. It converges really really quickly, but I`m having problems with the inference part. The variance of batch input means is just too big, so my stored averages don't normalize correctly. If I use (local) batch normalization for inference also, I more or less get back errors of the training, but I don't like the fact that the inference is dependent on the ordering of the test/validation data.

ebenolson commented 9 years ago

@takacsg84 can you share your code? I've started doing an implementation of this, but so far only doing normalization (not learning beta/gamma).

takacsg84 commented 9 years ago

@ebenolson sure! https://github.com/takacsg84/Lasagne/blob/d5545988e6484d1db4bb54bcfa541ba62e898829/lasagne/layers/bn2.py While training some additional steps have to be considered. First I get a list of layers that have pre_train and post_train methods:

        # get layers that need pre_train
        pre_train_layers = []
        for l in self.get_all_layers():
            if hasattr(l, 'pre_train'):
                pre_train_layers.append(l)

        # get layers that need post_train
        post_train_layers = []
        for l in self.get_all_layers():
            if hasattr(l, 'post_train'):
                post_train_layers.append(l)

later I call these methods before and after every full iteration so I maintain the averages for inference:

            for l in pre_train_layers:
                l.pre_train()

            for Xb, yb in self.batch_iterator_train(X_train, y_train):
                batch_train_loss = self.train_iter_(Xb, yb)
                train_losses.append(batch_train_loss)

            for l in post_train_layers:
                l.post_train()

ebenolson commented 9 years ago

thanks @takacsg84. When I read the paper before I completely missed the part that said to normalize right before the nonlinearity.

What dataset/network are you testing with? I modified your code a bit to do the moving average inside get_output, and it seems to be working ok for me on the MNIST example. I think there may still be a problematic interaction with dropout though.

takacsg84 commented 9 years ago

@ebenolson

When I read the paper before I completely missed the part that said to normalize right before the nonlinearity.

Also don't forget to remove the nonlinearity from the previous layers as well as the bias parameters (as it is redundant).

I modified your code a bit to do the moving average inside get_output

So you changed the way the means are calculated or just moved it to get_output, so it is only calculated when you actually want inference?

I think there may still be a problematic interaction with dropout though.

I agree! High dropout could mess up things a bit, but I tested my network w/o dropout, just to be sure.

ebenolson commented 9 years ago

So you changed the way the means are calculated or just moved it to get_output, so it is only calculated when you actually want inference?

I update the moving average each time get_output is called with deterministic=False

my changes and test script are in https://github.com/ebenolson/Lasagne/commit/5c0389671d9f9bf2e19a02e4bcfee554659b9e57

ebattenberg commented 9 years ago

Have you guys had any success with this? I gave it a try, and it truly does wonders with the training. It converges really really quickly, but I`m having problems with the inference part. The variance of batch input means is just too big, so my stored averages don't normalize correctly.

At test time, if you're computing the means at every layer independently for each batch, then there could be quite a bit of variation (especially in higher layers) between mini-batches. I'm not sure what the paper says about this, but I was assuming that in order to get test time estimates of the mean and variance of each unit, you'd want to do something like what I suggested in https://github.com/benanne/Lasagne/issues/141#issuecomment-75175029.

I think that the mean/variance of each unit across the entire training set are pretty specialized parameters in that they have to be computed iteratively layer-by-layer before test time and then communicated to the network at test time.

In this approach, you'd compute the mean/variance of the first layer for every mini-batch and then average those. Then use the average mean/variance across mini-batches to normalize the first layer for every mini-batch in order to compute the second layer. Then gather statistics for the second layer, and so on. So it's kind of this iterative thing where the output of a layer requires statistics from the previous layer from every mini-batch. Make sense?

Not sure if this is what they communicated in the paper, but I think it's probably the right thing to do at test time. I'm not really sure about the moving average suggestion for validation during training. Is that a moving average across validation mini-batches? If so, how do you normalize the first validation mini-batch? Or do you keep stats from the previous epoch (in which case the network could have changed quite a bit since)? Or are they suggesting to interleave validation mini-batches with training mini-batches so that the moving-average is a bit smoother?

This is all a bit confusing since the output of a layer depends on the statistics used to normalize the previous layer which is affected by both the data used to compute the statistics and the changing values of the weights.

takacsg84 commented 9 years ago

@ebattenberg You are totally right about the hierarchical calculation of batch statistics! The paper has no details on this, but it is logical what you suggested. I`ll implement this version also, make some tests, and let you know about the results!

ebenolson commented 9 years ago

I think they actually do propose layer-by-layer calculation of the statistics for inference in Algorithm 2. It seems like they use only use examples from the training set to determine the normalization, not validation data.

The simple moving average calculation I was doing worked ok on the MNIST example, but failed badly on a more complicated test, probably because the normalization is changing more. Replacing it with an exponential moving average improved things, but so far I've gotten the best results just from using the validation mini-batch statistics (non-deterministic).

aloisg commented 9 years ago

I'm pretty sure there is a technical limit to this or maybe my idea is wrong, but wouldn't it be possible to evaluate the validation score by processing the validation set as one batch (who would have the size of the whole set) ? What do i miss ?

benanne commented 9 years ago

If your validation or test set is small enough so you won't run out of GPU memory, go for it :) I guess technically some people would consider this cheating because you're calculating statistics on the validation / test set (technically you should calculate the means and variances on the train set and use those).

Unfortunately most validation sets are not small enough for this, in my experience.

ebattenberg commented 9 years ago

A couple thoughts:

Regarding Algorithm 2...

I think they actually do propose layer-by-layer calculation of the statistics for inference in Algorithm 2. It seems like they use only use examples from the training set to determine the normalization, not validation data.

I'm not sure I see this in Algorithm 2. In the loop starting at line 8, it looks like they're processing a single hidden node at a time. I guess if you processed the nodes in order from input to output, you'd effectively be doing this layer-by-layer, but they don't specify this.

Regarding model serialization...

I agree. I was thinking about this more, and it seems possible to solve this by providing an alternate route for parameter serialization (maybe implementing getstate and setstate to not serialize any theano variables, which would allow serialization of entire models without using get_all_params, and would generally make models easier to read and write).

An easy way to solve this serialization problem for a model that would only be used for inference (not further training) in the future, would be to compose the normalization with the beta/gamma linear transformation as they suggest at the end of 3.1. Then you'd just replace beta/gamma (which are Lasagne model parameters) with the composed versions, and the inference model would serialize just fine. But you'd have to keep the mean/var info around for further training in order to undo the composition.

f0k commented 9 years ago

Having finally read the paper myself, I think the batch normalization layer should be structured like this:

it has gamma and beta (vectors)
it also has mean and std (vectors)
get_output_for(input, deterministic=False) uses gamma, beta and T.mean(input), T.std(input)
get_output_for(input, deterministic=True) uses gamma, beta and mean, std
only gamma and beta are returned as trainable parameters
mean and std should be returned as serializable parameters. There's currently no way to do that.

During training, mean and std should be updated using a moving average so the network does something useful when computing the validation error (which is usually based on the network outputs with deterministic=False). We should provide some helper function to get these updates. Unluckily, this requires the inputs to all batch normalization layers, which means we must compute them along with the network outputs we use for the cost function, which in turn requires #104 to be solved.

After training, mean and std should be recomputed in the layer-wise fashion described by Eric. This falls outside the scope of Lasagne, but again, we could try to provide some help with that. I think the gains of collapsing gamma with mean / beta with std will be neglectible if Theano can be convinced to add / multiply the vectors together before applying them to the input minibatch, so I would avoid that to ensure models can be easily loaded and trained on as if nothing ever happened.

Regarding serialization, we should think again about extending or changing get_params() and get_bias_params() a bit. With this use case, we have three overlapping classes of parameters:

parameters involved in the forward pass (currently get_params() according to its docstring)
parameters to be updated wrt. the loss function (currently get_params() according to its usage in Lasagne)
parameters to be updated wrt. the loss function that are not to be included in weight regularization (currently get_bias_params())

From the documentation it is not even entirely clear if get_params() returns the first or second category. I'll open up an issue for that (or try to find an existing one, I remember having discussed get_params() with Sander before).

benanne commented 9 years ago

Thanks for the detailed analysis! I guess we'll have to tackle the API issue first before continuing with this.

f0k commented 9 years ago

During training, mean and std should be updated using a moving average so the network does something useful when computing the validation error (which is usually based on the network outputs with deterministic=False). We should provide some helper function to get these updates. Unluckily, this requires the inputs to all batch normalization layers, which means we must compute them along with the network outputs we use for the cost function, which in turn requires #104 to be solved.

It turns out that Theano has some additional functionality for specifying updates: If a shared variable in a graph has a default_update attribute, a corresponding update rule will be created on compilation with theano.function() (unless said shared variable is included in the explicit updates dictionary or no_default_updates=True is given to theano.function()). With some code on the edge between clever and hacky this can be used to incorporate updates to the std and mean variables in the graph produced with deterministic=False, and not include these updates in the graph produced with deterministic=True.

I think the gains of collapsing gamma with mean / beta with std will be neglectible if Theano can be convinced to add / multiply the vectors together before applying them to the input minibatch, so I would avoid that to ensure models can be easily loaded and trained on as if nothing ever happened.

Unfortunately, it turns out they use gamma and beta in the wrong order for them to be collapsed easily. Their formula is: ((input - mean) / std) * gamma + beta. We can rewrite this to (input - mean) * (gamma / std) + beta so at least gamma and std are merged before applying them to a higher-order tensor, but combining mean and beta would require some rescaling of beta, or changing the learned linear transform to (x + beta) * gamma. So maybe there would be some non-negligible performance gains from manually merging these four parameters into two parameters after training (losing the ability to continue training on them).

Anyway, I'm posting my implementation in a gist: https://gist.github.com/f0k/f1a6bd3c8585c400c190 It doesn't include mean and std if you only serialize get_all_param_values(), and it doesn't include any code for the layer-wise estimation of mean and std after training, but it can be easily plugged into your architecture by just wrapping every relevant layer in a batch_norm() call and then it should do The Right Thing.

benanne commented 9 years ago

The default_update trick is pretty cool, I didn't know that existed. On the one hand there's a bit of "magic" involved, on the other hand it makes the code look a lot cleaner if this stuff is handled behind the scenes. I'm not 100% sure yet how I feel about using this feature in Lasagne.

I like your proposed implementation, but before adding it to Lasagne I would suggest we deal with #164, so we can incorporate this in it.

f0k commented 9 years ago

I like your proposed implementation

Me too, but I think we shouldn't do this in Lasagne, it's just an implementation for people to use until we have something clean (that depends on #104 and #164).

shaih82 commented 9 years ago

i never quit realized if the mean is computed per pixel which is similar to "mean image" or per channel which is RGB mean value. i just don't know which is better...

f0k commented 9 years ago

i never quit realized if the mean is computed per pixel which is similar to "mean image" or per channel which is RGB mean value.

From the paper, they compute a shared mean and std. dev. per feature map (similar to how you usually have a shared bias per feature map). I.e., if the output of your convolutional layer is (64, 32, 200, 100) (32 channels of 200x100, with batch size 64), you would have a mean vector of (1, 32, 1, 1) obtained by averaging over all dimensions except for the second. That's what my code does.

boptimism commented 9 years ago

I have quite figure out the inference part of that paper. The key, I think, is to compute the moving average of the means and variance.

We use the unbiased variance estimate Var[x] = m · EB[σB2 ], where m−1 the expectation is over training mini-batches of size m and σB2 are their sample variances.

But does this process require all the previous epochs? For example, in my training of BN on MNIST, each epoch takes 200 mini-batches. After 2 epochs and 75 mini-batches, I call a stop to the training and start to infer on the test sets. So when I compute the population-wise means and variances, do I use all 475 mini-batches? I got an oscillatory performance of BN. I think the underlying assumptions to use mini-batch samples to reproduce the population stats is that each batch should be independent with each other. This will get violated if once includes all the mini-batches from previous epochs.

Also, the code doesn't seem efficient to me when it computes the means and variances after parameters gamma, beta, and weights are fixed.

Ninf ← Ntr // Inference BN network with frozen BN parameters Process multiple training mini-batches B, each of size m, and average over them:

If all the previous epochs need to be included, to compute the means and variance for inference is not a pleasant job. And each mini-batch training step can't provide any help for determine the means and variance for inference.

x-axis is NOT epochs. Each epoch is reached at interval of 10, e.g. 9, 19, 29, ... indicates an epoch is completed. The black line links all the points corresponding to completed epochs. screen shot 2015-03-30 at 8 19 56 pm

boptimism commented 9 years ago

OK, I think once the training is finished, only the training set is used for computing mean/variance that is going to be used by inference step. This way naturally kills the oscillation observed in my previous implementation. New result:

screen shot 2015-03-30 at 8 43 56 pm

f0k commented 9 years ago

The key, I think, is to compute the moving average of the means and variance.

In my initial experiments, I also got good results with just using the (exponential) moving average of means and standard deviations, with alpha=0.5 as in my code.

So when I compute the population-wise means and variances, do I use all 475 mini-batches?

No, you would freeze gamma and beta (and everything else in the network), then iterate once over your training set (or as much of it as you can afford) in mini-batches of the same size you used before, computing the mean of all mini-batch means and the mean of all mini-batch standard deviations. You replace the exponential average of means and variances by these true averages and hope that it improves inference. In my case it made things worse, which could have a number of reasons including a bug in my code. I'm postponing further investigations until #164 is done.

f0k commented 9 years ago

This way naturally kills the oscillation observed in my previous implementation.

You shouldn't have oscillations in either case. In your plot, performance drops in every tenth instance. Are you resetting something in between there? The idea of the exponential moving average is to strike a balance between (a) using information of many mini-batches (small alpha) and (b) using information that fit the current network weights (large alpha). It's just meant for monitoring progress, though, the "correct" way is to compute the averages after training with the final set of network weights.

boptimism commented 9 years ago

Nope, didn't reset anything in between tenth instances. I am also banging my head all day to try figuring out why the oscillations. It is not clear to me whether previous epochs should be included in the calculation of means and vars that are going to be used in inference stage. I would really appreciate if someone could elaborate on this.

f0k commented 9 years ago

I would really appreciate if someone could elaborate on this.

Didn't I elaborate enough? You want the means and std devs (let's call them "normalization constants") for inference to closely mimic the normalization constants used for training. In training, you use the actual mini-batch mean and std dev, so it's adapted to the current mini-batch and the current network weights. For inference, you cannot use the actual mini-batch mean and std dev (*), so you use the expected value over your training data, with fixed network weights. For validation during training, it would be expensive to compute the expected value over your training data with fixed network weights, so instead you use a moving average. A moving average takes into account a fixed number of past items (here: mini-batches), not all past items up to now. The more you take into account, the better the estimate because you're using multiple mini-batches, but the worse the estimate because the earlier ones may have had very different network weights (as a few weight updates happened in between). The exponential moving average is a weighted moving average that emphasizes recent mini-batches over older mini-batches, and that can be implemented much memory-efficiently (that's the main reason for using it).

It is not clear to me whether previous epochs should be included in the calculation of means and vars that are going to be used in inference stage.

Hopefully it's clear now that this question is not really valid. For the inference stage at test time, you don't care about anything that happened during training. For the inference stage at validation time, you include a fixed number of past mini-batches, and you don't care about epochs.

(*): You could, but then your predictions for a particular data point would depend on the other data points you include in its mini-batch, and you usually don't want that.

boptimism commented 9 years ago

The more you take into account, the better the estimate because you're using multiple mini-batches, but the worse the estimate because the earlier ones may have had very different network weights (as a few weight updates happened in between).

Wait, please check algorithm 2 in the original paper. The mini-batch's contribution to the means and variance is not added along the training process. Rather, it is when training finished, hence "frozen parameters" in the original paper, all the parameters are used to go back to all the mini-batches. Therefore, for step 10 in algo 2, there is no change for all the parameters, including weights. I think the workflow is like this:

Train BN, obtain weights, gamma, beta
Freeze all these parameter; go back to ALL the mini-batches used for training; use the frozen parameters to calculate expected means and unbiased variance using moving average/exponential moving average. (a)
"Normalize" test data at each layer with newly computed means and variance. (a) Since the last weights/gamma/beta is used to compute ALL the previous mini-batch's means and variance, I can't see a memory-efficient way to do it.:confused:

benanne commented 9 years ago

No, I'm pretty sure that's wrong. The moving average / exponential moving average is only used for validation DURING training. It makes sense there because the model parameters are constantly changing, so the true value of the means and variances would change over time as well. The moving average just allows you to make a somewhat stable estimate of these without having to do too much computation.

At test time, why would you compute a moving average if you're going to go through all the data anyway? That seems wasteful. Just compute the 'regular' mean and variance. These estimates will be very stable because they are computed on a lot of data (i.e. the entire training set or at least a sizable portion of it). I'm pretty sure that's what the authors of the paper did as well.

boptimism commented 9 years ago

@benanne , please check the paper Algorithm 2, step 7-11.

At test time, why would you compute a moving average if you're going to go through all the data anyway?

At inference stage, the author did mention:

Process multiple training mini-batches B, each of size m, and average over them

I was very confused when I first read the paper regarding this step... I guess the logic may be the variance and means from training sets provides good reference for the distribution of the test sets? I also implement an alternative: when test data set size is larger than 1, forget training sets, just do BN on test sets at each layer, meanwhile use the learned weights/gamma/beta to make predictions. This works OK on MNIST.

f0k commented 9 years ago

paper Algorithm 2, step 7-11

This is for post-training, i.e., for computing the test-time parameters. As Sander said, a moving average would be silly here. A moving average is only used for train-time validation. In the paper: "Using moving averages instead, we can track the accuracy of a model as it trains."

Process multiple training mini-batches B, each of size m, and average over them

This is a plain old average, not a rolling average.

I guess the logic may be the variance and means from training sets provides good reference for the distribution of the test sets?

That's a basic assumption you do in machine learning. You assume that your training set follows a similar distribution as your test data, in some way or another.

This works OK on MNIST.

And it has the caveat I've mentioned above.

boptimism commented 9 years ago

@f0k Thanks for your effort. I agree with what you said about the role of moving average. I guess my real question lies here:

This is for post-training, i.e., for computing the test-time parameters.

Test-time parameters include population statistics - means and variance from that plain old average. What's the proper way to compute it? It seems to me the authors of that paper use frozen parameters to go through all training mini-batches {B} to obtain the test-time parameters. This part still bothers me.

Correlations among training mini-batches. For example, I used 3 mini-batches to finish training the network, first 2 mini-batches form 1 epoch. Batch 3 and batch 1 are now correlated. Will this have effect on the approximation in the paper:

Var[x] = m · EB[σB2 ]/(m-1)
I still don't quite understand the caveat part you mentioned. The BN trained network is good at handling data distribution with mean 0 and variance 1. What's inappropriate to normalize the test-set to satisfy the same distribution? The process of transforming x-> xhat is solely dependent on test-set itself. Of course, when there is only 1 test sample, story will be different.

Thanks for your time and patience.

f0k commented 9 years ago

It seems to me the authors of that paper use frozen parameters to go through all training mini-batches {B} to obtain the test-time parameters. This part still bothers me.

In the paper, they write: "Process multiple training mini-batches \mathcal{B}". They leave it open how many mini-batches you use. If you can afford it, use all training mini-batches -- i.e., one full epoch over your training data -- otherwise, use a representative subset. Of course you shouldn't use the same mini-batch multiple times just because it was presented multiple times in training, that doesn't help your average at all.

Of course, when there is only 1 test sample, story will be different.

Exactly. What you want is a deterministic mapping from input data to output data. If I have three samples A, B, and C, I want pred([A]) + pred([B]) + pred([C]) == pred([A,B]) + pred([C]) == pred([A,B,C]). Note that you don't always have a fixed test set as in MNIST, but you want to build a machine that you can apply to unseen data.

boptimism commented 9 years ago

Thanks @f0k . As for the test-time parameters, I think the authors performed a bootstrap on the training mini-batches. That may help figuring out how many mini-batches one needs to consider for inference.

cancan101 commented 9 years ago

Any updates on this? Or is @f0k's implementation the way to go?

f0k commented 9 years ago

My implementation needs to be changed a bit to cater for the new get_params API. I already did that, I'll update the gist later. Because of the default_update trick for the running mean/std I wouldn't want to include it in Lasagne in this form, but it could become part of the Lasagne Recipes. We can also try to figure out a nicer way of including batch normalization with running mean/std computation for validation, but that can happen after the first release.

JackKelly commented 9 years ago

Thanks loads for updating your batch norm implementation, @f0k.

When I first ran it, I got this error:

Exception
Traceback (most recent call last):
  File "/homes/dk3810/workspace/python/neuralnilm/scripts/e536.py", line 224, in main
    run_experiment(net, epochs=None)
  File "neuralnilm/experiment.py", line 43, in run_experiment
    net.compile()
  File "neuralnilm/net.py", line 228, in compile
    **self.updates_kwargs)
  File "/homes/dk3810/workspace/python/Lasagne/lasagne/updates.py", line 324, in nesterov_momentum
    updates = sgd(loss_or_grads, params, learning_rate)
  File "/homes/dk3810/workspace/python/Lasagne/lasagne/updates.py", line 134, in sgd
    grads = get_or_compute_grads(loss_or_grads, params)
  File "/homes/dk3810/workspace/python/Lasagne/lasagne/updates.py", line 110, in get_or_compute_grads
    return theano.grad(loss_or_grads, params)
  File "/homes/dk3810/workspace/python/neuralnilm/env/local/lib/python2.7/site-packages/Theano-0.7.0-py2.7.egg/theano/gradient.py", line 529, in grad
    handle_disconnected(elem)
  File "/homes/dk3810/workspace/python/neuralnilm/env/local/lib/python2.7/site-packages/Theano-0.7.0-py2.7.egg/theano/gradient.py", line 516, in handle_disconnected
    raise DisconnectedInputError(message)
DisconnectedInputError: grad method was asked to compute the gradient 
with respect to a variable that is not part of the computational graph of the cost,
or is used only by a non-differentiable operator: mean

I think I've fixed this by passing trainable=True to lasagne.layers.get_all_params.

i.e. I now do this:

all_params = lasagne.layers.get_all_params(output_layer, trainable=True)
updates = lasagne.updates.nesterov_momentum(
    loss_train, all_params, learning_rate, momentum)

Is that the correct approach?!

f0k commented 9 years ago

I think I've fixed this by passing trainable=True to lasagne.layers.get_all_params. [...] Is that the correct approach?!

Yes, that's exactly what's needed. This is actually wrong in the MNIST example, I'll keep that in mind when rewriting it, thanks! The MNIST example still works because all its parameters participate in the forward pass for the training loss, but with batch normalization, some parameters (mean and std) only participate in the forward pass for the test output and will thus trigger the error you got.

JackKelly commented 9 years ago

cool, thanks for the quick reply :)

cancan101 commented 9 years ago

Does this work for convolutional layers such that the beta and gamma are shared for activations in all locations for a given filter?

f0k commented 9 years ago

Does this work for convolutional layers such that the beta and gamma are shared for activations in all locations for a given filter?

Yes, see the documentation of the axes argument: https://gist.github.com/f0k/f1a6bd3c8585c400c190#file-batch_norm-py-L26-L28

Ivanopolo commented 9 years ago

f0k, thank you for the code! I couldn't help but notice that after introducing batch norm layers in between my other layers the speed of batch processing dropped almost by a factor of two. Why could that be? Or it's supposed to work that slowly?

skaae commented 9 years ago

Will the Batch normalization layer be added to Lasagne?

@f0k : Can you comment on the difference between the Keras implementation (https://github.com/fchollet/keras/blob/master/keras/layers/normalization.py) and your implementation?

It seems that Keras does not use the default update trick but saves the updates with self.update? Could something similar be added to lasagne?

f0k commented 9 years ago

Or it's supposed to work that slowly?

It gets quite a bit slower, yes. I don't see a way to speed it up. The idea is that it would still converge more quickly in terms of wall time, but that depends on your data.

Can you comment on the difference between the Keras implementation and your implementation?

Hmm, they've got two modes: One is using the running batch mean and running batch std deviation (i.e., axis=0) to normalize the features, that's what my implementation does for deterministic=True to be able to compute the validation error. The other mode is doing something that doesn't have anything to do with the paper, it normalizes the last feature dimension (i.e., axis=-1) of each individual training example. They don't have a mode that does what's written in the paper they cite, but the first mode might work similarly well (that would need some experiments).

It seems that Keras does not use the default update trick but saves the updates with self.update?

Well, the only problem with the default update trick is that it will update the running mean and std deviation if and only if running with deterministic=False, assuming that you're doing this for training, and will use the running mean and std deviation instead of the batch-wise mean and std deviation if and only if running with deterministic=True. There's currently no way to say "use and update the running mean and std deviation during training" (the first mode in Keras), and there's no way to say "use the batch-wise mean and std deviation, but don't care about updating the running mean and std deviation".

Could something similar be added to lasagne?

We could add some explicit update mechanism, but that would complicate usage somewhat. It would need to be coupled with layers.get_output() or Layer.get_output_for(), because that's where we have the input expressions for each layer. So we'd either need to change the API to have get_output() return an update dictionary along with the output expression -- that's strongly against our fifth design goal -- or to optionally pass an update dictionary that can be modified by the layers in the get_output() call. I don't see much of an advantage against the current solution of merging the update dictionary with the output expression, though. This, too, could be controlled via some keyword argument to get_output() / get_output_for(), i.e., we could have BatchNormLayer react on some argument like update_norms that overrides the assumption based on deterministic=True or deterministic=False. And it relieves the user from having to obtain a second update dictionary to keep and pass on to theano.function.

So unless we find a use case that doesn't work well with default updates, I guess we could leave the API as it is and merge the batch normalization layer as it is. I'll make a PR some time soon to move the discussion there!

f0k commented 9 years ago

I don't see a way to speed it up.

Well, there'd be one... in the backward pass, the gradient now also flows back through the mean() and std() operations, but one could try treating the batch-wise mean and standard deviation as constants for the backward pass (basically, that's what Keras does when using a shared variable instead of the direct expressions). In the original paper, the authors do propagate back through the batch-wise mean and standard deviation, though (p. 4, after "We use chain rule" [sic]).

Lasagne / Lasagne

Batch Normalization #141