facebookresearch / phyre

PHYRE is a benchmark for physical reasoning.
https://phyre.ai
Apache License 2.0
430 stars 61 forks source link

Cross-template folds never contain template 6 in test or dev template sets (10 folds) #40

Closed augustinharter closed 3 years ago

augustinharter commented 3 years ago

Hi,

we have a question about the baseline results for cross template cases (PHYRE paper, Table 1, PHYRE-B, DQN, Cross).

Do we understand correct that in each cross-template fold there are 16 training templates and the remaining 9 templates (dev, test) (link) are used for measuring the auccess?

We found that template 6 was never part of the test templates in the 10 cross-template folds provided in the baseline code (link to 'fold method'), thus, template 6 does not contribute to the final cross-template auccess value. Is this intended or that just happened to be?

akhti commented 3 years ago

Hi @augustinharter ,

Do we understand correct that in each cross-template fold there are 16 training templates and the remaining 9 templates (dev, test) (link) are used for measuring the auccess?

Table 1 represent final results. For final results we trained on train+dev with hyperparameters obtained from preliminary experiments and tested on test. For preliminary experiments we trained all models on train and evaluated on dev.

We found that template 6 was never part of the test templates in the 10 cross-template folds provided in the baseline code (link to 'fold method'), thus, template 6 does not contribute to the final cross-template auccess value. Is this intended or that just happened to be?

That's a very interesting observation! It was not planned to be so, we use random allocation and apparently some tasks got luckier than others.