The splitting to train val test sets and using them

For a fixed fold id we split all tasks in train, val, and test. To do model and hyperparameter selection we trained on train and tested on val and aggregated the results over 3 splits to select the best parameters. The final results are on 10 different folds, i.e., 3 existing + 7 more. We could have trained the best config on the 7 remaining folds. But as dataset is relatively small, we trained 10 models on train+val from scratch using the best config and evaluated on test. Each of the models has not seen its test until the final evaluation. Note, that one cannot take an ensemble of the best models on train/val splits and apply it to test, because test of one model is train of another model.

The function that build the splits is called get_gold: https://phyre.ai/docs/evaluator.html

create_balanced_eval_set has nothing to do with split tasks in train and test. It takes a preselected set of task ids (e.g., train set and validation set) and set of actions, and builds a subset of Cartesian product X such that it contains equal number of positive example (actions that solve the tasks) and negative examples. Using a balanced subset to evaluate model helps to have more meaningful estimate of progresses as most of (task, action) pairs are negative.

facebookresearch / phyre

The splitting to train val test sets and using them #21