dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.36k stars 3k forks source link

Code running for distributed graph training #7543

Open onepiecewiley opened 1 month ago

onepiecewiley commented 1 month ago

Now, I want to run the graphsage distributed code in the examples/distributed directory, but I don’t have an actual machine, so I used vmware to build three virtual machines as nodes for distributed training. However, although I followed the readme to deploy the environment, set up NFS, etc., I found that I would get an error after running the code on node0 (the main node), saying "(fordgl) wiley@wiley-virtual-machine:/home/ubuntu/workspace$ python /home/ubuntu/workspace/dgl/tools/launch.py ​​–workspace /home/ubuntu/workspace/dgl/examples/distributed/graphsage/ --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config data/reddit.json --ip_config ip_config.txt “python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000” The number of OMP threads per trainer is set to 2 /home/ubuntu/workspace/dgl/tools/launch.py:148: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead thread.setDaemon(True) Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.128 ‘c d /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=2; python3 node_classification.py --graph_name red dit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85. 128 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_G RAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=0 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epoch s 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distribute d/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=1 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non- zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. cleanup process runs Task failed” I did some preliminary investigation and the error message said that the dgl package and torch package could not be found. I found that after executing the run command, the machine itself used the python interpreter in /usr/bin, instead of the fordgl environment named fordgl that I created with conda (this environment has all the packages). I set the environment variables, but it still didn’t help. Every time an error is reported, the python interpreter in /usr/bin is used instead of the python interpreter in the fordgl environment. Now I don’t know what to do. How can I solve this problem?

onepiecewiley commented 1 month ago

Does the lanuch.py ​​script use the python interpreter in /usr/bin by default? How to solve this problem?

Rhett-Ying commented 1 month ago

conda env is not used when launch distributed training. If you want to use conda env, you could try to specify the conda python like `python launch.py … “conda_python node_classification.py xxx” . I am not sure if work.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you