RexYing / diffpool

MIT License
479 stars 136 forks source link

About Dataset and Performance #17

Closed gtshs2 closed 5 years ago

gtshs2 commented 5 years ago

I have read your paper very interestingly and want to reproduce the performance for the ENZYMES dataset. First of all, thank you for your great paper and code.

However, performance is not reproducible, and diffpool implementations of pytorch-geometric also do not reproduce the performance of this paper. (Depending on a random seed, the performance difference is very severe.) I am curious about the experimental environment with data.

  1. I wonder if you divided the data by train/validation, or divided the data by train/validation/test. (The benchmark_task_val in your code does not split the test set.)
  2. What is the percentage of train/validation/test set?
  3. I wonder whether the performance on the paper is for the validation set or test set. Did you report the average 10-fold validation performance without an explicit test set?
RexYing commented 5 years ago

Hi thanks for the question. We only reported the 10 fold validation on validation performance (the epoch which has the best average performance over the 10 different splits). This is mainly due to 1) there is no standard split on graph datasets. But when averaging across different datasets, the variance is very different and the ranking of performance of GNN changes (similar but more severe than mentioned in https://arxiv.org/abs/1811.05868). But validation loss comparison is quite stable even across different algorithms. 2) the advantage of early stopping seems very limited for picking best validation for testing, also due to high performance variance for all GNNs. Although choosing best val for 1 split is often a result of overfitting, the best val for average of 10 splits has much lower variance, and the results are consistent for all baselines and out methods. Hence we picked best epoch over average of 10 splits. Similar to https://github.com/weihua916/powerful-gnns

For ENZYME dataset for example, compared to average performance at a fixed epoch (say 500), picking best for average of 10 splits is only 1% higher. And this advantage is the same for all the methods including baselines. But I'll add the 10 fold cross val with test set as well.