Crossvalidation - Githubissues

The method should look a bit like this:

def generate_test_train_split(data, num_splits):
    # think of something
    # return a list of tuples of sets, eg: [ (train_batch_ids_1, test_batch_ids_1), (train_batch_ids_2, test_store_ids2) , ...]

def evaluate_on_test_set(forecaster, data):
    # the test set can only be processed in multiple batches, call model.score(X, y) for all of them and average them

def estimate_score(Model, args, kwargs, data):
    test_scores = []
    splits = generate_test_train_split(data, NUM_INNER_ITERATIONS)
    for split in splits:
        forecaster = Model(*args, **kwargs)
        # if it's a tensorflow model, also create and initialize a session, see tests.py
        data.train_test_split(split) # this doesn't work at the moment, just ignore it for now so that we always train on the same data
        model.fit(data)
        acc = evaulate_on_test_set(model, data)
        test_scores.append(acc)

def model_selection(Models, params):
    # Models: list of classes, params: list of arg, kwarg tuples
    data = Data() # Use our data class, don't write your own. At the moment its data.feedforward_data.Data
    best_loss = np.inf
    for Model in Models:
        for args, kwargs in params:
            loss = estimate_score(Model, args, kwargs, data)
            if loss < best_loss:
                best_loss = loss
                best_model = (Model, args, kwargs)
    print("Best model", best_model, " with loss ", best_loss)
    return best_model

It can also look different if you use something from sklearn, but this should illustrate how you should evaluate our models on our data class. I wrote estimate_score and not cross_validate because we have so much data that cross validation might take too much time. You can do something similar and use always only subsets of the data, train multiple times and average. The difference would be that for one training, not the whole data set would be used, not even the union of your train and test set.

How can you do the train test splitting? It isn't implemented in the current data class, so you can think of a way that you want to do it and tell @MuhammadTaha . I think one idea would be to add a method train_test_split(train_batch_ids, test_batch_ids). In that case, you have to do the actual splitting (take care that for the multiple training of the same model, disjunct sets of store_ids will be used). @MuhammadTaha this is important for you: in the constructor of the data class, you should generate batch_ids and maybe even all batches, so that the split method can work like this. As long as that is not implemented, you can train on Data(toy=True), but keep in mind that we need to add this train_test_split later so that it will be easy to add.

We also said that we eventually want to train one model per store. I think it would be a good idea to ignore this here and implement that in the model: An "ExpertGroupForecaster" could create multiple other forecasters and redirect all batches to the expert for the store of the batch. But this is open to discussion, what do you think @mshaban @MuhammadTaha ?

MuhammadTaha / Predictive-Analysis

Crossvalidation #28