Avoid numerical instability in objectives.categorical_crossentropy

There exist numerical instability in objectives.categorical_crossentropy function, which cause gradient vanishing right after the first training batch. I suggest adding $\epsilon$ to prevent the probability goes close to 0.0 or 1.0 (original idea from keras implementation). Just simple tweak, however, it will significantly stablize the training without any performance trade-off

def categorical_crossentropy(predictions, targets):
    _EPSILON = 10e-8
    predictions = theano.tensor.clip(predictions, _EPSILON, 1.0 - _EPSILON)
    return theano.tensor.nnet.categorical_crossentropy(predictions, targets)

This is the code I used for the experiment (just replace dataset.load_minist() with your version of minist dataset)

from __future__ import print_function, division

import os
os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=cpu,floatX=float32"
import theano
from theano import tensor

import numpy as np
import scipy as sp

import dnntoolkit
reload(dnntoolkit)
import lasagne
reload(lasagne)

# ======================================================================
# Simulate data
# ======================================================================
mnist = dnntoolkit.dataset.load_mnist()
print(mnist)
X_train = mnist['X_train'].value.reshape(-1, 784)
y_train = mnist['y_train'].value
X_test = mnist['X_test'].value.reshape(-1, 784)
y_test = mnist['y_test'].value

X1, y1 = X_train[:512], y_train[:512]
X2, y2 = X_train[512:1024], y_train[512:1024]

y = tensor.matrix(name='y')
# ======================================================================
# Keras
# ======================================================================
import keras
import keras.models
import keras.layers.core
import keras.layers.noise
import keras.optimizers
import keras.objectives

m = keras.models.Sequential()
m.add(keras.layers.core.Dense(1024, input_shape=(784,), activation='relu'))
m.add(keras.layers.core.Dense(10, activation='softmax'))
k_pred = m.get_output(train=True)
k_params = [m.layers[0].W, m.layers[0].b, m.layers[1].W, m.layers[1].b]
k_cost_train = keras.objectives.categorical_crossentropy(y, k_pred)
k_grad = tensor.grad(k_cost_train, k_params)

k_grad = theano.function(
    inputs=[m.get_input(train=True), y],
    outputs=k_grad,
    allow_input_downcast=True)
print('Built keras function!')

# ======================================================================
# Test model
# ======================================================================
l_in = lasagne.layers.InputLayer(name='input', shape=(None, 784))
l = lasagne.layers.DenseLayer(l_in, num_units=1024,
    nonlinearity=lasagne.nonlinearities.rectify)
net = lasagne.layers.DenseLayer(l, num_units=10,
    nonlinearity=lasagne.nonlinearities.softmax)

lasagne.layers.set_all_param_values(net, m.get_weights())

input_var = [l_in.input_var]
y_pred = lasagne.layers.get_output(net)

cost_train = lasagne.objectives.categorical_crossentropy(y_pred, y).mean()

params = lasagne.layers.get_all_params(net, trainable=True)
grad = tensor.grad(cost_train, params)

f_grad = theano.function(
    inputs=input_var + [y],
    outputs=grad,
    allow_input_downcast=True
)
print('Built lasagne function!')

# ======================================================================
# Main test
# ======================================================================
print(k_grad(X1, y1))
print(f_grad(X1, y1))

If we decide to add this I think epsilon should be an optional keyword argument, so that it can be treated as a hyperparameter if desired. Also, it should probably default to 0, else we break backward compatibility (i.e. if people were to upgrade to a version of Lasagne where a nonzero epsilon is added by default, the objective function in their code would implicitly change).

Yes, I think it should be added soon, with the documentation, so at least everyone aware of numerical instability can happen there.

It was very difficult to debug the source of NaN gradient, until I created 2 similar models with similar weights in both Lasagne and Keras. The only different is their objectives function. You can check the test, run it manually several epoch, Keras model with stable objectives has no vanishing gradients. You can find the link to mnist dataset here: https://s3.amazonaws.com/ai-datasets/mnist.hdf

For the compatibility issue, should you guys have more effective version control ?

I don't see how this relates to version control. We just can't change part of the interface in a backwards-incompatible way.

Clipping predictions is a well-known trick, perhaps we should just rely on the user doing this on beforehand? Does this need to be part of the cross entropy function at all? Maybe it's more convenient that way though, and it should be a common use case after all.

Backward-compatibility is important but should not be the highest priority. My point is Lasagne should have better way to organize commits (e.g. tags) into small releases and big releases. There is only 1 initial release and you guys have releases enormous amount of changes and news into the library.

I did not mean that due to this numerical instability, you should made new release. However, a big changes in API should be possible if it is more advanced, and old user can keep their project working by using old release.

Yes, numerical instability is well-known issue as well, but it is hard to know when it's coming. From my naive view:

It is harmless to the overall performance of model
It increases the stability of training process

then, it is good for common use. Some float64 user may suffer from the change, then a default epsilon=0.0 would solve the issue

Backward-compatibility is important but should not be the highest priority.

I strongly disagree. The problem with changing the behaviour of existing code is that you cannot provide deprecation warnings. You have to make the change and then hope that everyone is aware that their existing code will suddenly behave slightly differently. This is not feasible in practice.

There is only 1 initial release and you guys have releases enormous amount of changes and news into the library.

So far we've expected most people would use the latest from GitHub anyway. Our continuous integration setup ensures that this is always stable enough for general use. But perhaps you're right, maybe we should consider releasing more often. The problem with that is that a lot of new features need some time to mature, so it is useful to have a time window in which changes (including API changes and backwards incompatible changes like the one you proposed here) can happen without any repercussions.

I did not mean that due to this numerical instability, you should made new release.

Nor did I think that was what you meant :)

then, it is good for common use. Some float64 user may suffer from the change, then a default epsilon=0.0 would solve the issue

Agree, if we implement this and set the default to 0.0 I have no problem with the change. We just need to think about whether this is the right thing to do (all things considered, it probably is).

I don't think compatibility is a strong argument against this. If it was a real bug that created completely incorrect results, we wouldn't/shouldn't worry about breaking compatibility by fixing it. Theano also has compatibility-breaking changes fairly often, so as long as we are recommending people use Theano master I don't see much point in us worrying about compatibility.

As I've said before (and I think @trungnt13 is implying), it would be nice if __version__ incorporated the commit, so that people could easily check what code they are running. But if someone really cares about exact reproducibility they probably ought to fix a version of Theano + Lasagne at the start of their experiments and never upgrade.

I'm not sure whether this particular change is necessary, but I did run into the same issue (and used the same fix) a few days ago myself. I doubt adding epsilon=0.0 will help many people though. Are there any drawbacks to nonzero epsilon besides compatibility?

I'm on Sander's side of not breaking backwards compatibility in a way that can't even be communicated to the user. Adding epsilon=0 with some documentation would be nice, though.

But if someone really cares about exact reproducibility they probably ought to fix a version of Theano + Lasagne at the start of their experiments and never upgrade.

What if I build on some previous experiments from half a year ago and want to use some of the more recent Lasagne features? Do I have to craft my own version of Lasagne for that by carefully merging in some of the changes and omitting others? Keeping things backwards-compatible (or break loudly if that's not possible) makes it a lot easier to upgrade Lasagne. We're doing a good job so far.

Theano also has compatibility-breaking changes fairly often

I was under the impression that they also cared about not breaking the interface. Do you have some example?

I doubt adding epsilon=0.0 will help many people though. Are there any drawbacks to nonzero epsilon besides compatibility?

The only other drawback is that somebody might be clipping the predictions already, or enforce predictions within (0,1) in some other way. But I'd advocate for "explicit is better than implicit" -- you should clip the predictions consciously, not unknowingly because somebody decided it would be good for you. It would be great to set up a simple "NaN guide" to help users encountering NaNs quickly find possible causes and counter-measurements, though (#298).

It was very difficult to debug the source of NaN gradient, until I created 2 similar models with similar weights in both Lasagne and Keras. The only different is their objectives function.

Sorry for the trouble -- but I think this is because Keras didn't document it, not because Lasagne didn't implement it. You probably weren't available that the predictions are clipped before you compared the implementations in detail, and you probably wouldn't have mentioned it in a research paper describing your experiments either, would you?

There is only 1 initial release and you guys have releases enormous amount of changes and news into the library.

Yes, that's something we're guilty of. We're not in a "release early, release often" mode, but careful to not introduce any changes in a release that have to be taken back. In addition, all core developers use the bleeding-edge version of Lasagne and Theano so there's not a lot of pressure to do a release (it's the same problem for Theano, I assume). Sorry!

This eps won't work with float 16. Can you make a condition and if the dtype is float16 use a smaller eps? That way people trying it won't discover this problem. Le 9 janv. 2016 14:02, "TrungNT" notifications@github.com a écrit :

There exist numerical instability in _objectivescategoricalcrossentropy function, which cause gradient vanishing right after the first training batch I suggest adding $\epsilon$ to prevent the probability goes close to 00 or 10 (original idea from keras implementation) Just simple tweak, however, it will significantly stablize the training without any performance trade-off

def categorical_crossentropy(predictions, targets): _EPSILON = 10e-8 predictions = theanotensorclip(predictions, _EPSILON, 10 - _EPSILON) return theanotensornnetcategorical_crossentropy(predictions, targets)

This is the code I used for experiment (just replace datasetload_minist() with your version of minist dataset)

from future import print_function, division

import os osenviron["THEANO_FLAGS"] = "mode=FAST_RUN,device=cpu,floatX=float32" import theano from theano import tensor

import numpy as np import scipy as sp

import dnntoolkit reload(dnntoolkit) import lasagne reload(lasagne)

======================================================================

Simulate data

======================================================================

mnist = dnntoolkitdatasetload_mnist() print(mnist) X_train = mnist['X_train']valuereshape(-1, 784) y_train = mnist['y_train']value X_test = mnist['X_test']valuereshape(-1, 784) y_test = mnist['y_test']value

X1, y1 = X_train[:512], y_train[:512] X2, y2 = X_train[512:1024], y_train[512:1024]

y = tensormatrix(name='y')

======================================================================

Keras

======================================================================

import keras import kerasmodels import keraslayerscore import keraslayersnoise import kerasoptimizers import kerasobjectives

m = kerasmodelsSequential() madd(keraslayerscoreDense(1024, input_shape=(784,), activation='relu')) madd(keraslayerscoreDense(10, activation='softmax')) k_pred = mget_output(train=True) k_params = [mlayers[0]W, mlayers[0]b, mlayers[1]W, mlayers[1]b] k_cost_train = kerasobjectivescategorical_crossentropy(y, k_pred) k_grad = tensorgrad(k_cost_train, k_params)

k_grad = theanofunction( inputs=[mget_input(train=True), y], outputs=k_grad, allow_input_downcast=True) print('Built keras function!')

======================================================================

Test model

======================================================================

l_in = lasagnelayersInputLayer(name='input', shape=(None, 784)) l = lasagnelayersDenseLayer(l_in, num_units=1024, nonlinearity=lasagnenonlinearitiesrectify) net = lasagnelayersDenseLayer(l, num_units=10, nonlinearity=lasagnenonlinearitiessoftmax)

lasagnelayersset_all_param_values(net, mget_weights())

input_var = [l_ininput_var] y_pred = lasagnelayersget_output(net)

cost_train = lasagneobjectivescategorical_crossentropy(y_pred, y)mean()

params = lasagnelayersget_all_params(net, trainable=True) grad = tensorgrad(cost_train, params)

f_grad = theanofunction( inputs=input_var + [y], outputs=grad, allow_input_downcast=True ) print('Built lasagne function!')

======================================================================

Main test

======================================================================

print(k_grad(X1, y1)) print(f_grad(X1, y1))

— Reply to this email directly or view it on GitHub https://github.com/Lasagne/Lasagne/issues/567.

Lasagne / Lasagne

Avoid numerical instability in objectives.categorical_crossentropy #567

======================================================================

Simulate data

======================================================================

======================================================================

Keras

======================================================================

======================================================================

Test model

======================================================================

======================================================================

Main test

======================================================================