Closed tyccc22 closed 6 months ago
@Rhett-Ying do we have DistDGL examples?
please refer to non-dist version of GAT/GCN models such as https://github.com/dmlc/dgl/tree/master/examples/pytorch/gat to make sure it's runnable. Model code should be same both in DistDGL and non-dist.
A better suggestion for running various model with distributed training/inference is utilizing GraphStorm which offers high level APIs.
Thanks for your advice. Since the "Gloo connectFullMesh failed with..." error is not resolved, I am trying to train some models from https://github.com/dmlc/dgl/tree/master/examples/pytorch/ on 2 machines.
Also, I would like to ask about dataset partitioning. When dividing the dataset with https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist/partition_graph.py, the memory size required is several times the size of the dataset. Are there any corresponding optimisations for memory, or are other tools provided?
Are there any corresponding optimisations for memory, or are other tools provided?
Unfortunately there's no much optimization available for the partition stage. dgl.distributed.partition_graph()
is the most convenient API that is available for now. But we also support partition graph with distributed pipeline if you have multiple machines with small CPU RAM. please refer to here for more details. This partition pipeline requires some more additional preprocesses.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.
🐛 Bug
When I run dgl\examples\pytorch\graphsage\dist\train_dist.py on GPUs as the file README.md, it works fine, but when changing the network layer of the model the following problem occurs:
To Reproduce
Steps to reproduce the behavior:
Change the network layer in dgl\examples\pytorch\graphsage\dist\train_dist.py in the following way
def run(args, device, data): ...
Define model and optimizer
/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \ --workspace ~/workspace/graphsage/ \ --num_trainers 1 \ --num_samplers 0 \ --num_servers 1 \ --part_config data2-ogb-product/ogb-product.json \ --ip_config ip_config.txt \ "/home/tyc/anaconda3/envs/gnn/bin/python3 gat-2-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"
Traceback (most recent call last): File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(inputs, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl( args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(inputs, kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
GCN
class DistGCN(nn.Module): def init(self, in_size, hid_size, out_size): super().init() self.layers = nn.ModuleList()
two-layer GCN
def run(args, device, data): ...
Define model and optimizer
/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \ --workspace ~/workspace/graphsage/ \ --num_trainers 1 \ --num_samplers 0 \ --num_servers 1 \ --part_config data2-ogb-product/ogb-product.json \ --ip_config ip_config.txt \ "/home/tyc/anaconda3/envs/gnn/bin/python3 gcn-dist-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"
Traceback (most recent call last): File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(inputs, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl( args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, *kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(inputs, kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[0] in group[0] is exiting...
AttributeError: 'list' object has no attribute 'local_scope'
ModuleNotFoundError: No module named 'numpy'
ModuleNotFoundError: No module named 'dgl'
[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with [/opt/conda/conda-bld/pytorch_1695392035629/work/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error