Closed gtshs2 closed 5 years ago
Hi thanks for the question. We only reported the 10 fold validation on validation performance (the epoch which has the best average performance over the 10 different splits). This is mainly due to 1) there is no standard split on graph datasets. But when averaging across different datasets, the variance is very different and the ranking of performance of GNN changes (similar but more severe than mentioned in https://arxiv.org/abs/1811.05868). But validation loss comparison is quite stable even across different algorithms. 2) the advantage of early stopping seems very limited for picking best validation for testing, also due to high performance variance for all GNNs. Although choosing best val for 1 split is often a result of overfitting, the best val for average of 10 splits has much lower variance, and the results are consistent for all baselines and out methods. Hence we picked best epoch over average of 10 splits. Similar to https://github.com/weihua916/powerful-gnns
For ENZYME dataset for example, compared to average performance at a fixed epoch (say 500), picking best for average of 10 splits is only 1% higher. And this advantage is the same for all the methods including baselines. But I'll add the 10 fold cross val with test set as well.
I have read your paper very interestingly and want to reproduce the performance for the ENZYMES dataset. First of all, thank you for your great paper and code.
However, performance is not reproducible, and diffpool implementations of pytorch-geometric also do not reproduce the performance of this paper. (Depending on a random seed, the performance difference is very severe.) I am curious about the experimental environment with data.