Open jimmyroyer opened 8 years ago
The batch normalization is separate from learning_rate
and weight_decay
, so it should be affected separately. Is that what you mean?
Sorry, it's a typo on my end. I mean that regardless of the learning_rate and weight_decay values, when I use normalization="batch" the predictions won't change. Without batch the normalization, the predictions are highly sensitive to different learning_rate/weight_decay, but with batch normalization the predictions become completely independent of learning_rate/weight_decay.
It's just wrapping Lasagne's code for batch normalization. How are the overall results?
I've been testing the batch normalization a couple of times and I consistently get worse AUCs compared to no normalization. I started investigating the sensitivity of the results to the learning_rate and then I found that with batch normalization the learning_rate doesn't seem to matter. It is counter intuitive to me because I though that with the normalization we could use higher learning_rate and converge faster.
I still don't understand why the weights and biases end up being exactly identical when I use batch normalization. Regardless of the learning rate or weight decay that I use, I always end up with the exact same trained models (same predictions errors etc.). Any help to understand the issue would be greatly appreciated. Thanks
If the weights are the same then it must be a bug! I'd start by putting a breakpoint / print statement in the lasagne backend where the BatchNorm layer is created.
Thanks! You suspect the bug is in lasagne?
Sent from BlueMail
On May 12, 2016, 4:03 AM, at 4:03 AM, "Alex J. Champandard" notifications@github.com wrote:
If the weights are the same then it must be a bug! I'd start by putting a breakpoint / print statement in the lasagne backend where the BatchNorm layer is created.
You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/aigamedev/scikit-neuralnetwork/issues/204#issuecomment-218687266
I think I have a little more insight at where the problem can reside. Basically, the backend cost function "regularizer" is set to exactly 0.0 when batch normalization is used. That explains, I think, why the weight decay parameter becomes irrelevant with batch normalization. I will try to dig further to see where the problem comes from but any more insight would be really useful.
I did the following exercise:
nn = Classifier( layers = [Layer("Softmax")], learning_rate = 0.1, weight_decay=0.001, regularize="L2")
nn._backend.regularize = 'L2' nn._backend.weight_decay = 0.001
_nn._backend.regularizer = Elemwise{add,no_inplace}.0_
And with batch normalization I get:
nn = Classifier( layers = [Layer("Softmax", normalize='batch')], learning_rate = 0.1, weight_decay=0.001, regularize="L2")
nn._backend.regularize = 'L2'
nn._backend.weight_decay = 0.001
_nn.backend.regularizer = 0.0
Actually, the problem seems to be the following: with batch normalization: nn._backend.mlp.get_params(regularizable=True) returns an empty list (there is noting to regularize). It might be related to the issue # 198
The precise problem is in: sknn/backend/lasagne/mlp.py -- Line 171
network = lasagne.layers.batch_norm(network)
if this function is called, the list returned by:
network.get_params()
is empty
This list is needed (among others) in Line 65:
self.regularizer = sum(layer_decay[s.name] * apply_regularize(l.get_params(regularizable=True), penalty)
I will try to track down the lasagne function batch_norm to see what is the issue. However it seems to break the batch normalization feature of sknn.
Ah, the problem is likely because l.get_params
refers to the batch_norm layer and no longer the original regularizable layer. Hmm.
If something similar is happening for the trainable
parameters it could explain learning_rate problems.
I think one possible way to fix it would be to replace l.get_params in Line:65 by " params" computed by something like:
params = [] while hasattr(l, 'input_layer'): params.extend(l.get_params(regularizable=True)) l = l.input_layer params=[params[1]]
It might have implications for other parts of the code as well (in the gradient maybe). I can try to find out.
This issue trickles down in the function _create_trainer_function (line 83). when Batch Normalization is toggled, the function self._learning_rule returns an empty dictionary. The function theano.function uses the default learning_rate and learning_rule.
def _create_trainer_function(self, params, cost):
if self.learning_rule in ('sgd', 'adagrad', 'adadelta', 'rmsprop', 'adam'):
lr = getattr(lasagne.updates, self.learning_rule)
self._learning_rule = lr(cost, params, learning_rate=self.learning_rate)
elif self.learning_rule in ('momentum', 'nesterov'):
lasagne.updates.nesterov = lasagne.updates.nesterov_momentum
lr = getattr(lasagne.updates, self.learning_rule)
self._learning_rule = lr(cost, params, learning_rate=self.learning_rate, momentum=self.learning_momentum)
else:
raise NotImplementedError(
"Learning rule type %s
is not supported." % self.learning_rule)
trainer = theano.function([self.data_input, self.data_output, self.data_mask], cost,
updates=self._learning_rule,
on_unused_input='ignore',
allow_input_downcast=True)
The exact issue is at line 263. mlp_layer.get_params() returns also an empty list (instead of a list with [W, b]. the list params in _create_trainer_function is empty and self._learning_rule returns an empty dictionary.
params = [] for spec, mlp_layer in zip(self.layers, self.mlp): if spec.frozen: continue params.extend(mlp_layer.get_params())
I think the issue is that the object that comes out of lasagne.layers.batch_norm(network) does not have the same attributes as the one that enters into the function. More precisely, the network doesn't seem to have the attribute get_params() anymore.
Should we contact lasagne developers?
Thanks a lot for your help. Your package is really a life saver.
I think this is an issue with the wrapper rather than Lasagne's fault. There are a few implications of this, may take a bit of time to fix... ELU activation (available as "ExpLin" work well without batch normalization).
I can help if you think it can be useful.
Hello, when I use normalize = "batch" in the mlp classifier, I find that the weights/biases/predictions would be affected when changing the learning_rate or the weight_decay parameters. Is it expected? Thanks again for all your help