Open CfromBU opened 2 months ago
dgl v2.1.0 can run normally, but dgl v2.2.0 returns this error.
GraphSAGE + ogbn-products + num_samplers = 2
dgl v2.3+torch v2.1, works well. dgl v2.3+torch v2.2, not tested. dgl v2.3+torch v2.3, crashed with this issue. dgl v2.4+torch v2.4, works well.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
π Bug
When we run dgl/tools/launch.py, it will return Failures if num_samplers = 1 and return _frozen_importlib._DeadlockError when num_samplers > 1.
To Reproduce
Steps to reproduce the behavior:
1.when num_samplers = 1: run
python3 dgl/tools/launch.py \ --workspace dgl/examples/distributed/graphsage/ \ --num_trainers 2 \ --num_samplers 1 \ --num_servers 1 \ --part_config data/ogbn-products.json \ --ip_config ip_config.txt \ --num_omp_threads 16 \ "python node_classification.py --graph_name ogbn-products --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"
output: Traceback (most recent call last):
2.when num_samplers > 1:
output:
Expected behavior
if num_samplers = 0, launch.py can run normally as follows:
Environment
conda
,pip
, source):pip