facebookresearch / phyre

PHYRE is a benchmark for physical reasoning.
https://phyre.ai
Apache License 2.0
431 stars 64 forks source link

The splitting to train val test sets and using them #21

Closed alberbohar closed 4 years ago

alberbohar commented 4 years ago

Hi , I'm a bit confused about the usage in the validation and train sets. First its bee said in the paper:"we use these tuned hyperparameter and train agents on the union of the training and validation sets", so, are you training on the validation set? Second, in the code: https://github.com/facebookresearch/phyre/blob/master/agents/neural_agent.py#L36 There is a function "create_balanced_eval_set", but it's seems like preparing data for the training procedure. I'm trying to understand the boundaries if any, of the train/val sets. Thanks.

akhti commented 4 years ago

For a fixed fold id we split all tasks in train, val, and test. To do model and hyperparameter selection we trained on train and tested on val and aggregated the results over 3 splits to select the best parameters. The final results are on 10 different folds, i.e., 3 existing + 7 more. We could have trained the best config on the 7 remaining folds. But as dataset is relatively small, we trained 10 models on train+val from scratch using the best config and evaluated on test. Each of the models has not seen its test until the final evaluation. Note, that one cannot take an ensemble of the best models on train/val splits and apply it to test, because test of one model is train of another model.

The function that build the splits is called get_gold: https://phyre.ai/docs/evaluator.html

create_balanced_eval_set has nothing to do with split tasks in train and test. It takes a preselected set of task ids (e.g., train set and validation set) and set of actions, and builds a subset of Cartesian product X such that it contains equal number of positive example (actions that solve the tasks) and negative examples. Using a balanced subset to evaluate model helps to have more meaningful estimate of progresses as most of (task, action) pairs are negative.