On-going benchmark of activations pull requests

ducha-aiki commented 8 years ago

Hi,

I have started an batchnorm/activations/architectures evaluation on ImageNet 2012 with image side = 128. The reason for 128 is that training is much faster (48 hours on GTX980) than default setup and not change overall picture.

The reason for ImageNet is that CIFAR10/MNIST experiments are not representable for big datasets. I.e. vlReLU better than ReLU on CIFAR10, but worse at ImageNet. BatchNorm hurts on CIFAR10, but helps on ImageNet, etc.

BatchNorm evaluation: https://github.com/ducha-aiki/caffenet-benchmark Activations evaluation: https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Activations.md Architectures evaluation: https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Architectures.md

I am not sure, if it is relevant as Issue or Wiki, but think that community could benefit from it.

P.S. Requests "what to evaluate next" are welcome. PRs with your tests are welcomed even more :) P.P.S. Current on-going training:

Maxout non-linearity with same complexity as for ReLUs - sqrt(2) narrower layers
ELU
PReLU
RReLU
BatchNorm with EltwiseAffine Layer, when fix then bug in it.

pliskowski commented 8 years ago

Just curious, do you have any idea why accuracy and loss curves have these steps? It seems to me that only the change in learning rate causes a huge leap in accuracy and there is hardly any learning between points where learning rate is stable.

ducha-aiki commented 8 years ago

@pliskowski Yes, steps are because of learning rate change. As for your second guess, it is not correct. First, learning is going even if it is slow. If you look at training logs, loss decreases. Second, there is a huge temptation to do steps, day not at each 100K, but at 100K, 120K and 140K. However, it will hurt your performance a lot - for several accuracy percent. It could be not a big deal for some practice tasks, but for ImageNet and, say, Kaggle, it makes all the difference in the world :)

pliskowski commented 8 years ago

@ducha-aiki I can see that there is learning but it is very slow compared to those leaps. Is there any explanation why decreasing learning rate causes such a significant boost in performance? I just cannot imagine how the classifier improves its performance over a few iterations just because of reducing the learning rate.

ducha-aiki commented 8 years ago

@pliskowski Imagine, you need to get a=1.2345. You start with a=0 and is allowed to add or subtract 1*learning_rate. If you begin to decrease learning rate too early, you never got even a=1. After you have got a = 1 with learning_rate=1, it is extremely hard to get a=1.23, until you decrease learning rate.

ducha-aiki commented 8 years ago

Added ELU, on-going PReLU and RReLU

ducha-aiki commented 8 years ago

Added BN+ReLUx(Dropout=0; 0.2; 0.5), 0.2 rules.

On-going:

(still :), but will finish ~ tomorrow ) PReLU, RReLU, maxout

ducha-aiki commented 8 years ago

Finished PReLU, RReLU, maxout test. Maxout results shows, that you need to wait until finish training to get results, not judge by first 100K iters

In progress:

data augmentation by 10% input dropout
SPP on caffenet (single scale training)

ducha-aiki commented 8 years ago

Added:

data augmentation by 10% input dropout (very bad).

In progress:

SPP on caffenet (single scale training)
ThinResNet. 100-layer deep resudual net with CaffeNet speed.

ducha-aiki commented 8 years ago

Added:

SPP on caffenet (single scale training)

In progress:

ThinResNet. 100-layer deep resudual net with CaffeNet speed.
Maxout + BN

ducha-aiki commented 8 years ago

Added nice tables with results

ducha-aiki commented 8 years ago

Added:

ThinResNet. 100-layer deep resudual net with CaffeNet speed. Maxout + BN

In progress:

multiscale training

ducha-aiki commented 8 years ago

Added stochastic pooling

ducha-aiki commented 8 years ago

Added one more attempt to train MSRA ResNet.

ducha-aiki commented 8 years ago

Added:

AvgPool
SGD vs. Nesterov
various lr_policies
scale augmentation
3x3 + 3x3 vs 5x5 vs 5x1 + 1x5 vs 3x1 + 1x3 + 3x1 +1x3 (by @dereyly )
BN (before ReLU) + EA

bhack commented 8 years ago

Do you plan to implement Net2Net to extend some of this archs without too much retraining?

ducha-aiki commented 8 years ago

@bhack yes, but not for testing - with Net2Net it would be infair comparison, I guess. For me, the outcome of comparison are not only final accuracies, but also training graphs, which can give you some insights.

And when cudnn3 will be supported by TensorFlow, I am afraid that migrate there.

bhack commented 8 years ago

@ducha-aiki Yes I meant for fast evaluation on improved accuracy. Then for formal testing you could always retrain the new experimented extended arch with scratch initialization.

You choice it is understandable.

bhack commented 8 years ago

@ducha-aiki "And when..." https://github.com/tensorflow/tensorflow/commit/22ebf0a94fd42af2d78b7964e836c92673ddfa31

ducha-aiki commented 8 years ago

@bhack thanks! Will try next week. btw, tomorrow new results in colorspaces for caffenet :)

ducha-aiki commented 8 years ago

Added colorspace, poolings, and googlenet-128 for baseline

ducha-aiki commented 8 years ago

@bhack looks like still no CUDA 7.5 Support :(

BVLC / caffe

On-going benchmark of activations pull requests #3437