tyccc22 commented 7 months ago

🐛 Bug

When I run dgl\examples\pytorch\graphsage\dist\train_dist.py on GPUs as the file README.md, it works fine, but when changing the network layer of the model the following problem occurs:

AttributeError: 'list' object has no attribute 'local_scope'

To Reproduce

Steps to reproduce the behavior:

The model can be trained well by running the following command. The code in workspace is copied from dgl\examples\pytorch\graphsage\dist\ .

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 1 --backend nccl"

Change the network layer in dgl\examples\pytorch\graphsage\dist\train_dist.py in the following way


# GAT
class DistGAT(nn.Module):
def __init__(
    self, in_feats, n_hidden, n_classes, heads
    # n_layers, activation, dropout
):
    super().__init__()
    self.gat_layers = nn.ModuleList()
    # two-layer GAT
    self.gat_layers.append(
        dglnn.GATConv(
            in_feats,
            n_hidden,
            heads[0],
            feat_drop=0.6,
            attn_drop=0.6,
            activation=F.elu,
        )
    )
    self.gat_layers.append(
        dglnn.GATConv(
            in_feats * heads[0],
            n_classes,
            heads[1],
            feat_drop=0.6,
            attn_drop=0.6,
            activation=None,
        )
    )

def forward(self, g, inputs):
    h = inputs
    for i, layer in enumerate(self.gat_layers):
        h = layer(g, h)
        if i == 1:  # last layer
            h = h.mean(1)
        else:  # other layer(s)
            h = h.flatten(1)
    return h

def run(args, device, data): ...

Define model and optimizer

model = DistGAT(
    in_feats,
    args.num_hidden,
    n_classes,
    heads=[8, 1]
)
    # args.num_layers,
    # F.relu,
    # args.dropout,
# )
...

execute

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \ --workspace ~/workspace/graphsage/ \ --num_trainers 1 \ --num_samplers 0 \ --num_servers 1 \ --part_config data2-ogb-product/ogb-product.json \ --ip_config ip_config.txt \ "/home/tyc/anaconda3/envs/gnn/bin/python3 gat-2-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"

The cluster starts as expected and then the following problem occurs

Traceback (most recent call last): File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in main(args) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main run(args, device, data) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run batch_pred = model(blocks, batch_inputs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(inputs, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, kwargs) # type: ignore[index] File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward h = layer(g, h) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward with graph.local_scope(): AttributeError: 'list' object has no attribute 'local_scope' Client[3] in group[0] is exiting... Traceback (most recent call last): File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in main(args) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main run(args, device, data) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run batch_pred = model(blocks, batch_inputs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(inputs, kwargs) # type: ignore[index] File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward h = layer(g, h) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward with graph.local_scope(): AttributeError: 'list' object has no attribute 'local_scope'

3. GCN is probably more similar to sage. If you make the same changes, the same message will appear

GCN

class DistGCN(nn.Module): def init(self, in_size, hid_size, out_size): super().init() self.layers = nn.ModuleList()

two-layer GCN

    self.layers.append(
        dglnn.GraphConv(in_size, hid_size, activation=F.relu)
    )
    self.layers.append(dglnn.GraphConv(hid_size, out_size))
    self.dropout = nn.Dropout(0.5)

def forward(self, g, features):
    h = features
    for i, layer in enumerate(self.layers):
        if i != 0:
            h = self.dropout(h)
        h = layer(g, h)
    return h

def run(args, device, data): ...

Define model and optimizer

# model = GCN(
#     in_feats,
#     args.num_hidden,
#     n_classes,
#     args.num_layers,
#     F.relu,
#     args.dropout,
# )
model = DistGCN(in_feats, 16, n_classes).to(device)
...

execute

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \ --workspace ~/workspace/graphsage/ \ --num_trainers 1 \ --num_samplers 0 \ --num_servers 1 \ --part_config data2-ogb-product/ogb-product.json \ --ip_config ip_config.txt \ "/home/tyc/anaconda3/envs/gnn/bin/python3 gcn-dist-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"

The information obtained is

Traceback (most recent call last): File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in main(args) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main run(args, device, data) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run batch_pred = model(blocks, batch_inputs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(inputs, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, kwargs) # type: ignore[index] File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward h = layer(g, h) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward with graph.local_scope(): AttributeError: 'list' object has no attribute 'local_scope' Client[3] in group[0] is exiting... Traceback (most recent call last): File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in main(args) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main run(args, device, data) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run batch_pred = model(blocks, batch_inputs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, *kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(inputs, kwargs) # type: ignore[index] File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward h = layer(g, h) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward with graph.local_scope(): AttributeError: 'list' object has no attribute 'local_scope' Client[0] in group[0] is exiting...


<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->

## Expected behavior

<!-- A clear and concise description of what you expected to happen. -->
Apply distributed training to the training of other models, e.g. GAT, GCN, GIN, etc.

## Environment

 - DGL Version (e.g., 1.0): DGL 2.1.0
 - Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 2.1.0
 - OS (e.g., Linux): ubuntu 20.04
 - How you installed DGL (`conda`, `pip`, source): conda
 - Build command you used (if compiling from source): 
 - Python version: Python 3.9.18
 - CUDA/cuDNN version (if applicable): cuda_12.1.0_530.30.02_linux
 - GPU models and configuration (e.g. V100): The graphics card on one machine is a GeForce RTX 2060 SUPER and the graphics card on the other machine is a GeForce GTX 1660 SUPER.
 - Any other relevant information:  I train the above models on a local cluster consisting of two computers and have not migrated them to the cloud yet. There is a different graphics card on each of these two computers.

## Additional context

  After reviewing the documentation on docs.dgl.ai, I am still unclear on how to resolve the following error:

AttributeError: 'list' object has no attribute 'local_scope'

  The code in the dgl/examples/pytorch/graphsage/dist file is quite enlightening, and I am interested in expanding it to incorporate additional models. Any guidance you could offer would be greatly appreciated.

<!-- Add any other context about the problem here. -->

  The command that executes the training has a few more parameters or paths than the command in README.md because the following problems occurs：
1. Probably because I installed dgl in conda's virtual environment, if I don't add a path to python3, there will be

ModuleNotFoundError: No module named 'numpy'

or

ModuleNotFoundError: No module named 'dgl'

2. If I use the default "--backend" parameter gloo, it comes up with

[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with [/opt/conda/conda-bld/pytorch_1695392035629/work/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error


  I have no idea how to solve this.

  Once again, thank you for your exceptional work!

BarclayII commented 7 months ago

@Rhett-Ying do we have DistDGL examples?

Rhett-Ying commented 7 months ago

please refer to non-dist version of GAT/GCN models such as https://github.com/dmlc/dgl/tree/master/examples/pytorch/gat to make sure it's runnable. Model code should be same both in DistDGL and non-dist.

A better suggestion for running various model with distributed training/inference is utilizing GraphStorm which offers high level APIs.

tyccc22 commented 7 months ago

Thanks for your advice. Since the "Gloo connectFullMesh failed with..." error is not resolved, I am trying to train some models from https://github.com/dmlc/dgl/tree/master/examples/pytorch/ on 2 machines.

Also, I would like to ask about dataset partitioning. When dividing the dataset with https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist/partition_graph.py, the memory size required is several times the size of the dataset. Are there any corresponding optimisations for memory, or are other tools provided?

Rhett-Ying commented 7 months ago

Are there any corresponding optimisations for memory, or are other tools provided?

Unfortunately there's no much optimization available for the partition stage. dgl.distributed.partition_graph() is the most convenient API that is available for now. But we also support partition graph with distributed pipeline if you have multiple machines with small CPU RAM. please refer to here for more details. This partition pipeline requires some more additional preprocesses.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

frozenbugs commented 6 months ago

Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.

dmlc / dgl

AttributeError: 'list' object has no attribute 'local_scope' #7292

🐛 Bug

To Reproduce

Define model and optimizer

GCN

two-layer GCN

Define model and optimizer