dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.43k stars 3.01k forks source link

[Bug] Unable to reproduce the results of some heterograph examples #908

Closed futurely closed 2 years ago

futurely commented 5 years ago

🐛 Bug

The results of running RGCN multiple times are not always consistent and are not always the same as the reported results.

Dataset Accuracy Reported Accuracies of Reruns Losses of Reruns
AIFB 97.22% (DGL), 95.83% (paper) 0.9722/0.9722/0.9722 0.8165/0.7983/0.7981
MUTAG 73.53% (DGL), 73.23% (paper) 0.6912/0.7353/0.7206 0.5819/0.5592/0.5683
BGS 93.10% (DGL), 83.10% (paper) 0.9310/0.8966/0.8966 0.3893/0.4152/0.3973
AM 91.41% (DGL), 89.29% (paper) 0.9091/0.9091/0.9091 1.6796/1.6790/1.6916

The results of running HAN multiple times are consistent but are not the same as the reported results.

-- micro f1 score macro f1 score test loss
Paper 89.22 89.40
DGL 88.99 89.02
Softmax regression (own dataset) 89.66 89.62
DGL (own dataset) 91.51 91.66
DGL rerun 87.91 87.91 0.3889
DGL (own dataset) rerun 91.94 92.03 0.2351

To Reproduce

Steps to reproduce the behavior:

https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero#entity-classification

  1. python3 entity_classify.py -d aifb --testing --gpu 0
  2. python3 entity_classify.py -d mutag --l2norm 5e-4 --n-bases 30 --testing --gpu 0
  3. python3 entity_classify.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
  4. python3 entity_classify.py -d am --l2norm 5e-4 --n-bases 40 --testing --gpu 0

https://github.com/dmlc/dgl/tree/master/examples/pytorch/han

  1. python main.py
  2. python main.py --hetero

Expected behavior

Reproducible experimental results across different runtime environments.

Environment

Additional context

Need a script to print the above environment information automatically.

jermainewang commented 5 years ago

Hi @futurely , the results do vary across different runs as is noted by the author too. Here is the results of ten runs:

image

I will update the readme to clarify the results.

futurely commented 5 years ago

HAN reruns get the same results in the same environment by setting the random seed. The different results in different environments must be caused by something else.

RGCN also needs to set the random seed to get fixed results.

Reproducible environment can be obtained with Docker.

jermainewang commented 5 years ago

With more runs, the averaged outcomes become more and more stable. Deterministic behavior is useful for debugging but not necessary for model performance. Random seed cannot solve everything especially when the system has concurrency which has impact on numerical outcomes. With that being said, I think reporting averaged result from multiple runs is fine (and is also well acknowledged) and reporting standard deviation or min/max range is recommended if the variance is large.

Edit: @mufeili would you please take a look at the HAN result?

futurely commented 5 years ago

Very few researches on GNN repeat random experiments multiple times to compare both average values and standard deviation ranges.

A good example is Keep It Simple: Graph Autoencoders Without Graph Convolutional Networks which uses metrics “averaged over 100 runs with different random train/validation/test splits” to show that linear autoencoder is competitive with multi-layer GCN autoencoders.

yzh119 commented 5 years ago

@futurely , dgl uses atomic operations in cuda kernels, and we can not guarantee deterministic even if we fixed all random seeds. (PyTorch has similar issues for several operators: https://pytorch.org/docs/stable/notes/randomness.html).

Though I don't think it's a good habit for ML researchers to report the best metric with a fixed random seed rather than report average metric for multiple runs with different random seeds, however I understand them if they do so. Yes we would try to remove atomic operations in dgl 0.5 and guarantee deterministic.

According to my experience, the non-deterministic issue would affect the result very very little if the dataset is relatively large. If the performance of a GNN model on small datasets(I'm not suggesting cora/citeseer/pubmed.. but they actually are) would differ much just because of the randomness in atomic operations (0.001 + 0.1 + 0.01 or 0.01 + 0.001 + 0.1 ?), I think researchers should better turn to a larger dataset(not that fragile) or report average result of multiple runs so that the results would be more convincing. If a paper claims its model outperforms a baseline by 0.* with a fixed random seed, who knows if it is a random noise or a substantial progress obtained by the model itself.

futurely commented 5 years ago

There are too many factors that make the model performance hard to reproduce and compare. It is necessary to benchmark the representative algorithms with the same framework, datasets (including preprocessing), runtime environment and hyperparameters. The hyperparameters for each algorithm should not just use the default values of the original papers or implementations but should be thoroughly (auto-)tuned.

PyG has a benchmark suite on a few typical tasks with some small datasets.

Google benchmarked classic object detection algorithms with production level implementations.

yzh119 commented 5 years ago

@futurely I agree with all your points. What I mean is that: for small datasets, researchers should report their models' average performance with different random seeds across multiple runs, or the result does not make any sense.

futurely commented 5 years ago

I also agree with you.

My point is that a benchmark suite implementing best practices should be added in DGL or another related repo. The suite can be frequently run to show the latest model and speed performance improvements. It is helpful to attract more researchers to implement algorithms with DGL and contribute back.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] commented 2 years ago

This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.

962673247 commented 11 months ago

@futurely,dgl在cuda内核中使用原子操作,即使我们修复了所有随机种子,我们也不能保证确定性。(PyTorch 对于几个运算符也有类似的问题:https://pytorch.org/docs/stable/notes/randomness.html)。

虽然我认为机器学习研究人员使用固定随机种子报告最佳指标而不是报告使用不同随机种子多次运行的平均指标并不是一个好习惯,但如果他们这样做,我理解他们。是的,我们会尝试删除 dgl 0.5 中的原子操作并保证确定性。

根据我的经验,如果数据集相对较大,非确定性问题对结果的影响非常小。如果 GNN 模型在小数据集上的性能(我不是建议 cora/citeseer/pubmed.. 但它们实际上是)会因为原子操作的随机性而有很大差异(0.001 + 0.1 + 0.01 或 0.01 + 0.001 + 0.1?),我认为研究人员应该更好地转向更大的数据集(不是那么脆弱)或报告多次运行的平均结果,以便结果更有说服力。如果一篇论文声称其模型在固定随机种子的情况下优于基线 0.*,谁知道这是随机噪声还是模型本身获得的实质性进展。

It is now November 2023. Do you have a solution to the problem of the inability to reproduce DGL's results