Experiments, Benchmark and Hyperoptimizer

I've come to a point where I'm doing something that I though should be fairly simple, but which is a lot more complicated than expected with our current code. I need your thoughts on the matter.

I have several models I want to benchmark against one another. To do so, I have several benchmarks I want a few models to compete on. To do so, I need to:

Iterate through the benchmarks;
Iterate through the models;
Find the best hyper-parameters;
Test from a few seeds and see the results;
Keep track of how each model performed;
Output appropriate plots.

As it is, the library is set up to facilitate the training of a single model on a single dataset with a single batch of hyperparameters. I want something higher-level. We could even think about integrating the long awaited spearmint.

Some ideas, all mixed together:

Create three more levels of training: Experiment, Benchmark and Hyperoptimizer (I'm certainly open to other names).
The Experiment is a collection of Benchmark. It runs them, then aggregate data about them.
The Benchmark is used to pit models against each other and find the best-performing one. To do so, it trains and then collects appropriate data about each of them.
The Hyperoptimizer is used to find the best hyperparameters for a model to perform a specific task.
Some way to add test hyperparameters. You know, just to make sure the code can run through the whole script without crashing.
Deal with the issue of the RNG. As of now, nothing is really reproducible. I suggest a hard reset of numpy/theano/blocks RNGs at some level (and it's not clear which one... all of them?)

For now, a grid/random search of the Hyperoptimizer would be great. The Benchmark and Experiment would need more definite roles and type of data to collect. This brings me to the how part.

We will need a way to configure an experiment. The more I work on problems, the more I'm tempted to make a more general class to set and access those parameters. However, this would be such a pain to generalize enough.

We also need a standard way to collect data. I can easily see an Experiment telling a Benchmark what data to collect, which would then cascade all the way to the Trainer and its Tasks. I don't want to have to hardcode everything every time, even though for now it's the easiest solution.

I think those classes have the potential, if done right, to make SL a much greater help in scientific reproducibility and it would accelerate a lot the process of getting experiments done faster.

SMART-Lab / smartlearner

Experiments, Benchmark and Hyperoptimizer #54