Is SeLU alone having positive impact on accuracy?

jczaja commented 6 years ago

Hi,

In MNIST, Cifar-10 tutorials there is Selu as well as alpha dropout used and result after experiment is that SNN outperforms ReLU, ELU based models. Mnist based models (Lenet) can work without dropout, batchNorm with quite good accuracy, so my question is if Selu alone (no dropout and Batch Norm) is according to your observations, increasing accuracy? What I mean is that I have model that is working on MNIST and is a basic CNN eg. convolutions, ReLU, Fully Connected and softmax, and assuming that initialization of weights and normalization of input was done correctly can I expect increased accuracy?

gklambauer commented 6 years ago

Hello jczaja,

No, just adding SELUs (with according normalization) will in general not improve the accuracy of your CNNs -- in fact we are even a bit surprised that SELUs work well also for CNNs (not only for FNNs). Especially if you have a "working model" developed with ReLUs, it cannot be expected that this works well or even better with SELUs.

There are multiple reasons why this is the case:

The architectures were developed/optimized for ReLUs and are therefore biased towards ReLUs. SELUs can code more information and they could lead to an overfitting network. Typically, the architectures, such as LeNet, went through a large optimization process and it would be a coincidence if the selected architectures also work well with SELUs.
Convolutional and max-pooling layers could have effects that cannot be countered by SELUs alone.
Hyperparameters such as learning rates, dropout rates, regularization parameters, etc were also optimized for ReLU networks and are therefore biased towards ReLUs.

This being said, there were quite a number of successes, where we just exchanged the activation function (and initialization and dropout) and ended up with improved networks, e.g. SqueezeNet, the CIFAR10 example in this repository, and some unnamed/in-house CNNs for biological data. This means that your strategy is definitely a possible way to go...

Regards, Günter

jczaja commented 6 years ago

Hi gklambauer,

You mentioned that : "Convolutional and max-pooling layers could have effects that cannot be countered by SELUs alone"

I understand that max-pooling is changing variance of signal (increasing it) and may counter the normalization that SELU performs, as SELU normalized the signal iteratively. But what are the problems with convolutions in terms of normalization performed by SELU ?

gklambauer commented 6 years ago

Conv layers could be problematic for the central limit theorem since only few inputs are summed. However, those are "averaged" across positions of the image/feature map, which could be beneficial again.

jczaja commented 6 years ago

I have a question about normalization of input. Is it really needed? I understand that making input zero-meaned is inportant but is Variance(x) = 1 , needed? Therem 2 and 3 are giving some boundries to input considering that input is zero-meaned and weights are initalized properly?

gklambauer commented 6 years ago

Yes, as you stated, in theory, after a couple of fully-connected layers, the variance goes to one anyway. However, empirically, we found that scaling inputs to unit variance helps the network to learn faster. If you are thinking about ConvNets... there we typically use a global mean and variance for input normalization (as in our example https://github.com/bioinf-jku/SNNs/blob/master/SelfNormalizingNetworks_CNN_CIFAR10.ipynb).

jczaja commented 6 years ago

Thanks very much for your work on SNN and very detailed answers to my questions.

Regards, Jacek

bioinf-jku / SNNs

Is SeLU alone having positive impact on accuracy? #5