ICT-GIMLab / SeHGNN

94 stars 16 forks source link

Inconsistency between SeHGNN paper results and actual running results #5

Closed GooLiang closed 1 year ago

GooLiang commented 1 year ago

Hello author, I ran the code in hgb to reproduce the accuracy in the paper, but the results seem to be inconsistent with the paper. For example, I run the SeHGNN code without variant on the ACM dataset, and the result f1-score is around 94.6, which is inconsistent with the results of macro:94, micro:93.98 in the paper. What is the reason for this?

GooLiang commented 1 year ago

In addition, when I fixed seed=1, the results of multiple tests were inconsistent. Even if I remove the amp parameter, still have this problem

Yangxc13 commented 1 year ago

Thanks for your attention.

For the first question, as we have mentioned near the bottom of the file ./hgb/Readme.md, the HGB benckmark only makes public 50% labels of the test set. Therefore, for offline evaluation, the reported results are only based on half of the test data. To reproduce results in our paper, please submit the output *.txt file to the website of HGB for online evaluation.

Specifically, for data loading [link], test_nid are ids of half of the test data and test_nid_full are ids of all test data. The output files are generated with all test data here [link].

For the second question, please provide more information to me. I run the following command twice and get the same results.

python main.py --stage 200 --dataset ACM --act leaky-relu --n-layers-1 2 --n-layers-2 3 --num-hops 4 --num-label-hops 3 --label-feats --hidden 512 --embed-size 512 --amp --seeds 1
GooLiang commented 1 year ago

Thanks for your reply. The first problem has been solved, but I'm still a bit confused. For HGB codes, only ids of half of the test data are used for online evaluation. For SeHGNN code, use ids of all test data. What is the difference between the two? For the second question, I ran the following code twice but got different results. python main.py --stage 20 --dataset ACM --act leaky-relu --n-layers-1 2 --n-layers-2 3 --num-hops 4 --num-label-hops 3 --label-feats --hidden 512 --embed-size 512 --amp --seeds 1 --gpu 1

微信图片_20230113093859 微信图片_20230113093834
Yangxc13 commented 1 year ago

Online evaluation -> all test nodes | Offline evaluation -> half test nodes whose labels are public

If we strictly follow HGB's rules, we should report the results based on all test nodes through online evaluation. It is what we do in the Experiment section of our paper.

However, HGB's evaluation website only accepts three times submission every day, which cannot satisfy our need when trying different algorithms or conducting comparison experiments. For example, when verifying Finding 2 on the DBLP dataset in the Motivation section of our paper, we tried 32 combinations of metapaths and layer numbers. To mitigate the influence of noise, each experiment are repeated 20 times with different train-val data partitions. It would take more than one month if we collect all results through online evaluation.

Offline evaluation is actually compromise as HGB does not make public all test labels. HGB makes public only half labels of test nodes, and we call it offline evaluaion for evaluation on these half test nodes. It is a strong-enough metric (using 50% of results to represent 100% results) to reflect results on all test nodes, and models with good results through offline evaluaion will also get good results through online evaluation, with only slight differences on final scores.

Actually, for most time during the development of this model SeHGNN, we evaluate each model iteration through offline evaluation. We do not use online evaluation until we think the current version model is well-developed and it is time to check results on all test nodes. For readers, we also recommend offline evaluation for fast iteration when developing your own models.


I still cannot reproduce the bug in the second question. Maybe you can try to find an empty folder and re-clone this repository. I recommend we can further discuss this problem through e-mails.

Yangxc13 commented 1 year ago

After discussion with @GooLiang, we finally find problem 2 (unable to reproduce results with the same seed) is caused by using an older version of dgl (version 0.4.3). After updating dgl to version 0.8, this problem is solved.

We recommend researchers to use the latest version of dgl. Thank @GooLiang for his work!