Currently the stochastic optimizers produce highly variable training loss values when using 10 trials on synthetic tasks (e.g. charset). Also tuning on max_epochs gives much better performance for benchmark_stoch, which suggests more epochs are needed for properly tuning the hyper-parameters.
To check if any improvement:
using a linear learning rate schedule alpha/(1+decay*t) instead of the exponential one alpha/(1+t)^decay, such that the optimizer does not get stuck
always use a learning rate decay, but have 0 as the tunable decay rate
performing the hyper-parameter tuning on more epochs, but potentially on a smaller dataset
Currently the stochastic optimizers produce highly variable training loss values when using 10 trials on synthetic tasks (e.g. charset). Also tuning on max_epochs gives much better performance for benchmark_stoch, which suggests more epochs are needed for properly tuning the hyper-parameters.
To check if any improvement: