[DistDGL] Program hangs/crashes sometimes due to unknown reason

Rhett-Ying commented 2 years ago

🐛 Bug

Recently, we have been reported that program hangs at very beginning when distributed train. I found it's more likely if many trainers are booted and cython enabled DGL(it's enabled in the packages from dgl.ai) is used. I checked the call stacks of clients and servers, seems all of them are waiting for messages from the opposite peer. It looks weird. seems some message are gone. Though messages has been sent out from clients, but they does not arrive at server.

For now, I have a WAR for this: https://github.com/dmlc/dgl/pull/3867. It fixes the hang issue (at least in my several tests) though the root cause is not found yet. This WAR aims to send messages blockingly and this may slow down the message send speed. so performance degrade could happen.

I will continue on this until root cause is found.

To Reproduce

//tools/launch.py --num_trainers 16 .... //examples/pytorch/graphsage/dist/train_dist.py

Expected behavior

Environment

DGL Version (e.g., 1.0): DGL 0.8
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.10
OS (e.g., Linux): Ubuntu
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version (if applicable): 10.2/11.0
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

Rhett-Ying commented 2 years ago

I find app hangs even with send TensorPipe message blockingly(https://github.com/dmlc/dgl/pull/3867) and model.join()(https://github.com/dmlc/dgl/pull/3870).

self-built DGL, synced on March 30, 2022
//example/pytorch/graphsage/dist/train_dist.py
ogb-products
num_parts=2
num_batches are equal

--num_trainers 8 --num_samplers 0 --num_servers 1 --batch_size 256 OK
--num_trainers 4 --num_samplers 4 --num_servers 4 --batch_size 256 HANG, hang point differs in each run(epoch_0, epoch_2, seems always in the first epochs), but always hang.
--num_trainers 4 --num_samplers 0 --num_servers 4 --batch_size 256 OK
--num_trainers 1 --num_samplers 4 --num_servers 1 --batch_size 256 HANG sometimes

according to previous tests, num_sampler > 0 seems to blame.

Rhett-Ying commented 2 years ago

@zheng-da reported below issue. In this case, num_batch is different for trainers.

Part 0 | Epoch 00000 | Batch 000 | Train Acc: 0.0000 | Train Loss (ALL|GNN): 7.6276|7.6276 | Time: 4.3357
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:510] op.preamble.length <= op.nbytes. 24640 vs 4
Traceback (most recent call last):
  File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/distributed/algorithms/join.py", line 274, in __exit__
  File "/home/dzzhen/m5-gnn/python/m5gnn/model/rgcn_node_base.py", line 179, in fit
    loss.backward()
  File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    join_hook.main_hook()
  File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 193, in main_hook
    ddp._match_all_reduce_for_bwd_pass()
  File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1070, in _match_all_reduce_for_bwd_pass
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [10.2.23.254]:1439: Connection reset by peer
    Variable._execution_engine.run_backward(
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [10.2.23.254]:49550: Connection reset by peer

zheng-da commented 2 years ago

I get exactly the same error. this problem needs to be fixed.

Rhett-Ying commented 2 years ago

I have not reproduced it. I just copied from your comments..

ilil96 commented 2 years ago

For now, I have a WAR for this: #3867. It fixes the hang issue (at least in my several tests) though the root cause is not found yet. This WAR aims to send messages blockingly and this may slow down the message send speed. so performance degrade could happen.

I will continue on this until root cause is found.

Doesn't it mean that the latest nightly build version does not cause this problem, although with some slowdown? I'm running distributed training with DGL-0.9a220408 but it still hangs when I set both num_trainers and num_samplers greater than 0.

Rhett-Ying commented 2 years ago

No, hang issue is not fixed yet. It still hangs sometimes.

VoVAllen commented 2 years ago

@ilil96 What's your python version? Some user reports changing from python 3.8 to 3.9 solves the problem

ilil96 commented 2 years ago

@VoVAllen I'm using 3.7. I will try upgrading it to 3.9.

ilil96 commented 2 years ago

@VoVAllen Upgrading python does not work for me.

Rhett-Ying commented 2 years ago

@ilil96 could you try with appending --num_omp_threads 1 to launch.py? what's the numbers you set for num_trainers, num_samplers, num_servers? And how many cpu cores of each your machine?

ilil96 commented 2 years ago

@Rhett-Ying I'm testing the code with two 8-core machines each of which is equipped with two GPUs. num_trainers=2, num_samplers=2, num_servers=1. I tried appending --num_omp_threads 1. It works longer than before, but it still hangs at some point.

Ravsehajsinghpuri commented 2 years ago

Hi, I have been working on a project since quite a while and facing the same set of issues people have raised above and in other issue threads. The system is hanging randomly and the issue seems to be a network communication bug. I tried with the following version v0.8.0post2 but the problem still seems to persist. Any WAR to this to fix this temporarily?

zheng-da commented 2 years ago

can you move back and use DGL 0.7? do you require any specific features in DGL 0.8?

Ravsehajsinghpuri commented 2 years ago

@zheng-da Thanks for the suggestion, it worked for me although I don't know why there is an issue in the recent versions. I hope they resolve the communication bug in future versions.

zheng-da commented 2 years ago

we are looking at the problem. will try to fix it as soon as possible.

Rhett-Ying commented 2 years ago

@Ravsehajsinghpuri could you share the complete launch command you're using?

Ravsehajsinghpuri commented 2 years ago

/home/diml_2022/anaconda3/envs/diml_2022/bin/python /home/diml_2022/dgl-master/tools/launch.py \ --workspace /home/diml_2022/workspace_4 \ --num_trainers 1 \ --num_samplers 1 \ --num_servers 1 \ --num_omp_threads 1 \ --part_config /home/diml_2022/workspace_4/partitioned_graphs_1M/partitioned_graph_1M.json \ --ip_config /home/diml_2022/workspace_4/ip_config.txt \ "/home/diml_2022/anaconda3/envs/diml_2022/bin/python training_scripts/train_dist.py --graph_name 'partitioned_graph_1M' --ip_config ~/workspace_4/ip_config.txt --part_config ~/workspace_4/partitioned_graphs_1M/partitioned_graph_1M.json --batch_size_eval 1000 --num_epochs 2 --eval_every 2"

Rhett-Ying commented 2 years ago

@Ravsehajsinghpuri could you try with --num_samplers 0 to see if it hangs?

BarclayII commented 2 years ago

@ilil96 Which example are you running? Also, is it related to the dataset you are using (i.e. does it hang with OGB-products as well, or does it only hang with your dataset?)

Rhett-Ying commented 2 years ago

@ilil96 @Ravsehajsinghpuri @zheng-da could you try with my PR(patch it to master branch and build from source. here's the instructions: https://docs.dgl.ai/install/index.html#install-from-source) to see if works for your cases. All you need to do is enable socket rpc backend in your train_dist.py via:

dgl.distributed.initialize(args.ip_config, net_type='socket')

Rhett-Ying commented 2 years ago

socket net_type is enabled in latest nightly build.

DelightRun commented 2 years ago

I'm also facing the hang problem during GraphSAGE training, but has some differences with @Rhett-Ying 's:

The hang occurs even num_samplers = 0.
dgl.initialize doesn't cause hang, but remote data access does, such as 'CountNonZeroRequest' and 'PullRequest'
The hang usually occurs during mask split calls and batch data fetching.

I use GDB with PythonDebug to trace the call stack, It seems the program hang at TPReceiver->Recv:

Any ideas?

Rhett-Ying commented 2 years ago

@DelightRun which DGL version are you using? could you try with latest nightly build and enable socket backend via dgl.distributed.initialize(args.ip_config, net_type='socket')?

DelightRun commented 2 years ago

@DelightRun which DGL version are you using? could you try with latest nightly build and enable socket backend via dgl.distributed.initialize(args.ip_config, net_type='socket')?

@Rhett-Ying I've tried version 0.9a20220506 from https://data.dgl.ai/wheels-test/repo.html and the program still got stuck. However, after switching to my manually built version (from master branch), the train program runs smoothly. Isn't nightly version on https://data.dgl.ai/wheels-test/repo.html built from master branch?

Rhett-Ying commented 2 years ago

@DelightRun when running with nightly build or built on your own, did you enable socket backend?

Nightly build is built from master branch, but a slight difference may exist between the nightly build and the one you built from source.

DelightRun commented 2 years ago

@Rhett-Ying Turns out it's not the problem of backends but the build version. When using my manually build version, both backends (tensorpipe and socket) works. But when using official nightly build release, program always hangs for both backends. I'm not sure what the precise problem is, maybe just my environment isn't compatible with the official release.

Rhett-Ying commented 2 years ago

@DelightRun this is not expected. could you share your environment(such as machine, latest commit of dgl when you build on your own). And what's the nightly dgl version you used? and is it possible to share a minimal repro? or you're just running the examples in dgl?

jermainewang commented 2 years ago

Have identified the root cause to be Cython not releasing python GIL when invoking a C API. It will cause deadline in some corner cases in distributed training. A patch has been PRed and will be merged soon.

DelightRun commented 2 years ago

@jermainewang 牛逼！But why the case doesn't occur when using manually compiled library. Is this GIL problem related to Cython version?

Rhett-Ying commented 2 years ago

@DelightRun Yes. If you run python3 setup.py install under //dgl/python/, cython is really picked up. Otherwise, ctypes is used which releases GIL automatically when calling c functions.

dmlc / dgl