Make stochastic optimizers more robust

Currently the stochastic optimizers produce highly variable training loss values when using 10 trials on synthetic tasks (e.g. charset). Also tuning on max_epochs gives much better performance for benchmark_stoch, which suggests more epochs are needed for properly tuning the hyper-parameters.

To check if any improvement:

using a linear learning rate schedule alpha/(1+decay*t) instead of the exponential one alpha/(1+t)^decay, such that the optimizer does not get stuck
always use a learning rate decay, but have 0 as the tunable decay rate
performing the hyper-parameter tuning on more epochs, but potentially on a smaller dataset

accosmin / nano

Make stochastic optimizers more robust #152