Closed futurely closed 2 years ago
Hi @futurely , the results do vary across different runs as is noted by the author too. Here is the results of ten runs:
I will update the readme to clarify the results.
With more runs, the averaged outcomes become more and more stable. Deterministic behavior is useful for debugging but not necessary for model performance. Random seed cannot solve everything especially when the system has concurrency which has impact on numerical outcomes. With that being said, I think reporting averaged result from multiple runs is fine (and is also well acknowledged) and reporting standard deviation or min/max range is recommended if the variance is large.
Edit: @mufeili would you please take a look at the HAN result?
Very few researches on GNN repeat random experiments multiple times to compare both average values and standard deviation ranges.
A good example is Keep It Simple: Graph Autoencoders Without Graph Convolutional Networks which uses metrics “averaged over 100 runs with different random train/validation/test splits” to show that linear autoencoder is competitive with multi-layer GCN autoencoders.
@futurely , dgl uses atomic operations in cuda kernels, and we can not guarantee deterministic even if we fixed all random seeds. (PyTorch has similar issues for several operators: https://pytorch.org/docs/stable/notes/randomness.html).
Though I don't think it's a good habit for ML researchers to report the best metric with a fixed random seed rather than report average metric for multiple runs with different random seeds, however I understand them if they do so. Yes we would try to remove atomic operations in dgl 0.5 and guarantee deterministic.
According to my experience, the non-deterministic issue would affect the result very very little if the dataset is relatively large. If the performance of a GNN model on small datasets(I'm not suggesting cora/citeseer/pubmed.. but they actually are) would differ much just because of the randomness in atomic operations (0.001 + 0.1 + 0.01 or 0.01 + 0.001 + 0.1 ?), I think researchers should better turn to a larger dataset(not that fragile) or report average result of multiple runs so that the results would be more convincing. If a paper claims its model outperforms a baseline by 0.* with a fixed random seed, who knows if it is a random noise or a substantial progress obtained by the model itself.
There are too many factors that make the model performance hard to reproduce and compare. It is necessary to benchmark the representative algorithms with the same framework, datasets (including preprocessing), runtime environment and hyperparameters. The hyperparameters for each algorithm should not just use the default values of the original papers or implementations but should be thoroughly (auto-)tuned.
PyG has a benchmark suite on a few typical tasks with some small datasets.
Google benchmarked classic object detection algorithms with production level implementations.
@futurely I agree with all your points. What I mean is that: for small datasets, researchers should report their models' average performance with different random seeds across multiple runs, or the result does not make any sense.
I also agree with you.
My point is that a benchmark suite implementing best practices should be added in DGL or another related repo. The suite can be frequently run to show the latest model and speed performance improvements. It is helpful to attract more researchers to implement algorithms with DGL and contribute back.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.
@futurely,dgl在cuda内核中使用原子操作,即使我们修复了所有随机种子,我们也不能保证确定性。(PyTorch 对于几个运算符也有类似的问题:https://pytorch.org/docs/stable/notes/randomness.html)。
虽然我认为机器学习研究人员使用固定随机种子报告最佳指标而不是报告使用不同随机种子多次运行的平均指标并不是一个好习惯,但如果他们这样做,我理解他们。是的,我们会尝试删除 dgl 0.5 中的原子操作并保证确定性。
根据我的经验,如果数据集相对较大,非确定性问题对结果的影响非常小。如果 GNN 模型在小数据集上的性能(我不是建议 cora/citeseer/pubmed.. 但它们实际上是)会因为原子操作的随机性而有很大差异(0.001 + 0.1 + 0.01 或 0.01 + 0.001 + 0.1?),我认为研究人员应该更好地转向更大的数据集(不是那么脆弱)或报告多次运行的平均结果,以便结果更有说服力。如果一篇论文声称其模型在固定随机种子的情况下优于基线 0.*,谁知道这是随机噪声还是模型本身获得的实质性进展。
It is now November 2023. Do you have a solution to the problem of the inability to reproduce DGL's results
🐛 Bug
The results of running RGCN multiple times are not always consistent and are not always the same as the reported results.
The results of running HAN multiple times are consistent but are not the same as the reported results.
To Reproduce
Steps to reproduce the behavior:
https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero#entity-classification
https://github.com/dmlc/dgl/tree/master/examples/pytorch/han
Expected behavior
Reproducible experimental results across different runtime environments.
Environment
conda
,pip
, source): pipAdditional context
Need a script to print the above environment information automatically.