aigamedev / scikit-neuralnetwork

Deep neural networks without the learning cliff! Classifiers and regressors compatible with scikit-learn.
BSD 3-Clause "New" or "Revised" License
1.21k stars 221 forks source link

Batch Normalization with sknn MLP #204

Open jimmyroyer opened 8 years ago

jimmyroyer commented 8 years ago

Hello, when I use normalize = "batch" in the mlp classifier, I find that the weights/biases/predictions would be affected when changing the learning_rate or the weight_decay parameters. Is it expected? Thanks again for all your help

alexjc commented 8 years ago

The batch normalization is separate from learning_rate and weight_decay, so it should be affected separately. Is that what you mean?

jimmyroyer commented 8 years ago

Sorry, it's a typo on my end. I mean that regardless of the learning_rate and weight_decay values, when I use normalization="batch" the predictions won't change. Without batch the normalization, the predictions are highly sensitive to different learning_rate/weight_decay, but with batch normalization the predictions become completely independent of learning_rate/weight_decay.

alexjc commented 8 years ago

It's just wrapping Lasagne's code for batch normalization. How are the overall results?

jimmyroyer commented 8 years ago

I've been testing the batch normalization a couple of times and I consistently get worse AUCs compared to no normalization. I started investigating the sensitivity of the results to the learning_rate and then I found that with batch normalization the learning_rate doesn't seem to matter. It is counter intuitive to me because I though that with the normalization we could use higher learning_rate and converge faster.

jimmyroyer commented 8 years ago

I still don't understand why the weights and biases end up being exactly identical when I use batch normalization. Regardless of the learning rate or weight decay that I use, I always end up with the exact same trained models (same predictions errors etc.). Any help to understand the issue would be greatly appreciated. Thanks

alexjc commented 8 years ago

If the weights are the same then it must be a bug! I'd start by putting a breakpoint / print statement in the lasagne backend where the BatchNorm layer is created.

jimmyroyer commented 8 years ago

Thanks! You suspect the bug is in lasagne?

Sent from BlueMail

On May 12, 2016, 4:03 AM, at 4:03 AM, "Alex J. Champandard" notifications@github.com wrote:

If the weights are the same then it must be a bug! I'd start by putting a breakpoint / print statement in the lasagne backend where the BatchNorm layer is created.


You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/aigamedev/scikit-neuralnetwork/issues/204#issuecomment-218687266

jimmyroyer commented 8 years ago

I think I have a little more insight at where the problem can reside. Basically, the backend cost function "regularizer" is set to exactly 0.0 when batch normalization is used. That explains, I think, why the weight decay parameter becomes irrelevant with batch normalization. I will try to dig further to see where the problem comes from but any more insight would be really useful.

I did the following exercise:

nn = Classifier( layers = [Layer("Softmax")], learning_rate = 0.1, weight_decay=0.001, regularize="L2")

nn._backend.regularize = 'L2' nn._backend.weight_decay = 0.001

_nn._backend.regularizer = Elemwise{add,no_inplace}.0_

And with batch normalization I get:

nn = Classifier( layers = [Layer("Softmax", normalize='batch')], learning_rate = 0.1, weight_decay=0.001, regularize="L2")

nn._backend.regularize = 'L2' nn._backend.weight_decay = 0.001
_nn.backend.regularizer = 0.0

jimmyroyer commented 8 years ago

Actually, the problem seems to be the following: with batch normalization: nn._backend.mlp.get_params(regularizable=True) returns an empty list (there is noting to regularize). It might be related to the issue # 198

jimmyroyer commented 8 years ago

The precise problem is in: sknn/backend/lasagne/mlp.py -- Line 171

network = lasagne.layers.batch_norm(network)

if this function is called, the list returned by:

network.get_params()

is empty

This list is needed (among others) in Line 65:

self.regularizer = sum(layer_decay[s.name] * apply_regularize(l.get_params(regularizable=True), penalty)

I will try to track down the lasagne function batch_norm to see what is the issue. However it seems to break the batch normalization feature of sknn.

alexjc commented 8 years ago

Ah, the problem is likely because l.get_params refers to the batch_norm layer and no longer the original regularizable layer. Hmm.

If something similar is happening for the trainable parameters it could explain learning_rate problems.

jimmyroyer commented 8 years ago

I think one possible way to fix it would be to replace l.get_params in Line:65 by " params" computed by something like:

params = [] while hasattr(l, 'input_layer'): params.extend(l.get_params(regularizable=True)) l = l.input_layer params=[params[1]]

It might have implications for other parts of the code as well (in the gradient maybe). I can try to find out.

jimmyroyer commented 8 years ago

This issue trickles down in the function _create_trainer_function (line 83). when Batch Normalization is toggled, the function self._learning_rule returns an empty dictionary. The function theano.function uses the default learning_rate and learning_rule.

def _create_trainer_function(self, params, cost): if self.learning_rule in ('sgd', 'adagrad', 'adadelta', 'rmsprop', 'adam'): lr = getattr(lasagne.updates, self.learning_rule) self._learning_rule = lr(cost, params, learning_rate=self.learning_rate) elif self.learning_rule in ('momentum', 'nesterov'): lasagne.updates.nesterov = lasagne.updates.nesterov_momentum lr = getattr(lasagne.updates, self.learning_rule) self._learning_rule = lr(cost, params, learning_rate=self.learning_rate, momentum=self.learning_momentum) else: raise NotImplementedError( "Learning rule type %s is not supported." % self.learning_rule)

    trainer = theano.function([self.data_input, self.data_output, self.data_mask], cost,
                               updates=self._learning_rule,
                               on_unused_input='ignore',
                               allow_input_downcast=True)
jimmyroyer commented 8 years ago

The exact issue is at line 263. mlp_layer.get_params() returns also an empty list (instead of a list with [W, b]. the list params in _create_trainer_function is empty and self._learning_rule returns an empty dictionary.

params = [] for spec, mlp_layer in zip(self.layers, self.mlp): if spec.frozen: continue params.extend(mlp_layer.get_params())

jimmyroyer commented 8 years ago

I think the issue is that the object that comes out of lasagne.layers.batch_norm(network) does not have the same attributes as the one that enters into the function. More precisely, the network doesn't seem to have the attribute get_params() anymore.

Should we contact lasagne developers?

Thanks a lot for your help. Your package is really a life saver.

alexjc commented 8 years ago

I think this is an issue with the wrapper rather than Lasagne's fault. There are a few implications of this, may take a bit of time to fix... ELU activation (available as "ExpLin" work well without batch normalization).

jimmyroyer commented 8 years ago

I can help if you think it can be useful.