Closed Rhett-Ying closed 2 years ago
I find app hangs even with send TensorPipe message blockingly
(https://github.com/dmlc/dgl/pull/3867) and model.join()
(https://github.com/dmlc/dgl/pull/3870).
self-built DGL, synced on March 30, 2022
//example/pytorch/graphsage/dist/train_dist.py
ogb-products
num_parts=2
num_batches are equal
--num_trainers 8 --num_samplers 0 --num_servers 1 --batch_size 256
OK--num_trainers 4 --num_samplers 4 --num_servers 4 --batch_size 256
HANG, hang point differs in each run(epoch_0, epoch_2, seems always in the first epochs), but always hang.--num_trainers 4 --num_samplers 0 --num_servers 4 --batch_size 256
OK--num_trainers 1 --num_samplers 4 --num_servers 1 --batch_size 256
HANG sometimesaccording to previous tests, num_sampler > 0
seems to blame.
@zheng-da reported below issue. In this case, num_batch
is different for trainers.
Part 0 | Epoch 00000 | Batch 000 | Train Acc: 0.0000 | Train Loss (ALL|GNN): 7.6276|7.6276 | Time: 4.3357
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:510] op.preamble.length <= op.nbytes. 24640 vs 4
Traceback (most recent call last):
File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/distributed/algorithms/join.py", line 274, in __exit__
File "/home/dzzhen/m5-gnn/python/m5gnn/model/rgcn_node_base.py", line 179, in fit
loss.backward()
File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
join_hook.main_hook()
File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 193, in main_hook
ddp._match_all_reduce_for_bwd_pass()
File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1070, in _match_all_reduce_for_bwd_pass
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/dzzhen/.local/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [10.2.23.254]:1439: Connection reset by peer
Variable._execution_engine.run_backward(
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [10.2.23.254]:49550: Connection reset by peer
I get exactly the same error. this problem needs to be fixed.
I have not reproduced it. I just copied from your comments..
For now, I have a WAR for this: #3867. It fixes the hang issue (at least in my several tests) though the root cause is not found yet. This WAR aims to send messages blockingly and this may slow down the message send speed. so performance degrade could happen.
I will continue on this until root cause is found.
Doesn't it mean that the latest nightly build version does not cause this problem, although with some slowdown? I'm running distributed training with DGL-0.9a220408 but it still hangs when I set both num_trainers and num_samplers greater than 0.
No, hang issue is not fixed yet. It still hangs sometimes.
@ilil96 What's your python version? Some user reports changing from python 3.8 to 3.9 solves the problem
@VoVAllen I'm using 3.7. I will try upgrading it to 3.9.
@VoVAllen Upgrading python does not work for me.
@ilil96 could you try with appending --num_omp_threads 1
to launch.py
?
what's the numbers you set for num_trainers
, num_samplers
, num_servers
? And how many cpu cores of each your machine?
@Rhett-Ying I'm testing the code with two 8-core machines each of which is equipped with two GPUs. num_trainers=2, num_samplers=2, num_servers=1. I tried appending --num_omp_threads 1. It works longer than before, but it still hangs at some point.
Hi, I have been working on a project since quite a while and facing the same set of issues people have raised above and in other issue threads. The system is hanging randomly and the issue seems to be a network communication bug. I tried with the following version v0.8.0post2 but the problem still seems to persist. Any WAR to this to fix this temporarily?
can you move back and use DGL 0.7? do you require any specific features in DGL 0.8?
@zheng-da Thanks for the suggestion, it worked for me although I don't know why there is an issue in the recent versions. I hope they resolve the communication bug in future versions.
we are looking at the problem. will try to fix it as soon as possible.
@Ravsehajsinghpuri could you share the complete launch command you're using?
/home/diml_2022/anaconda3/envs/diml_2022/bin/python /home/diml_2022/dgl-master/tools/launch.py \ --workspace /home/diml_2022/workspace_4 \ --num_trainers 1 \ --num_samplers 1 \ --num_servers 1 \ --num_omp_threads 1 \ --part_config /home/diml_2022/workspace_4/partitioned_graphs_1M/partitioned_graph_1M.json \ --ip_config /home/diml_2022/workspace_4/ip_config.txt \ "/home/diml_2022/anaconda3/envs/diml_2022/bin/python training_scripts/train_dist.py --graph_name 'partitioned_graph_1M' --ip_config ~/workspace_4/ip_config.txt --part_config ~/workspace_4/partitioned_graphs_1M/partitioned_graph_1M.json --batch_size_eval 1000 --num_epochs 2 --eval_every 2"
@Ravsehajsinghpuri could you try with --num_samplers 0
to see if it hangs?
@ilil96 Which example are you running? Also, is it related to the dataset you are using (i.e. does it hang with OGB-products as well, or does it only hang with your dataset?)
@ilil96 @Ravsehajsinghpuri @zheng-da could you try with my PR(patch it to master branch and build from source. here's the instructions: https://docs.dgl.ai/install/index.html#install-from-source) to see if works for your cases. All you need to do is enable socket
rpc backend in your train_dist.py
via:
dgl.distributed.initialize(args.ip_config, net_type='socket')
socket
net_type is enabled in latest nightly build.
I'm also facing the hang problem during GraphSAGE training, but has some differences with @Rhett-Ying 's:
num_samplers = 0
.dgl.initialize
doesn't cause hang, but remote data access does, such as 'CountNonZeroRequest' and 'PullRequest'I use GDB with PythonDebug to trace the call stack, It seems the program hang at TPReceiver->Recv
:
Any ideas?
@DelightRun which DGL version are you using? could you try with latest nightly build and enable socket
backend via dgl.distributed.initialize(args.ip_config, net_type='socket')
?
@DelightRun which DGL version are you using? could you try with latest nightly build and enable
socket
backend viadgl.distributed.initialize(args.ip_config, net_type='socket')
?
@Rhett-Ying
I've tried version 0.9a20220506
from https://data.dgl.ai/wheels-test/repo.html and the program still got stuck.
However, after switching to my manually built version (from master branch), the train program runs smoothly.
Isn't nightly version on https://data.dgl.ai/wheels-test/repo.html built from master branch?
@DelightRun
when running with nightly build or built on your own, did you enable socket
backend?
Nightly build is built from master branch, but a slight difference may exist between the nightly build and the one you built from source.
@Rhett-Ying
Turns out it's not the problem of backends but the build version.
When using my manually build version, both backends (tensorpipe
and socket
) works. But when using official nightly build release, program always hangs for both backends.
I'm not sure what the precise problem is, maybe just my environment isn't compatible with the official release.
@DelightRun this is not expected. could you share your environment(such as machine, latest commit of dgl when you build on your own). And what's the nightly dgl version you used? and is it possible to share a minimal repro? or you're just running the examples in dgl?
Have identified the root cause to be Cython not releasing python GIL when invoking a C API. It will cause deadline in some corner cases in distributed training. A patch has been PRed and will be merged soon.
@jermainewang 牛逼!But why the case doesn't occur when using manually compiled library. Is this GIL problem related to Cython version?
@DelightRun Yes. If you run python3 setup.py install
under //dgl/python/
, cython
is really picked up. Otherwise, ctypes
is used which releases GIL automatically when calling c functions.
🐛 Bug
Recently, we have been reported that program hangs at very beginning when distributed train. I found it's more likely if many trainers are booted and
cython
enabled DGL(it's enabled in the packages from dgl.ai) is used. I checked the call stacks of clients and servers, seems all of them are waiting for messages from the opposite peer. It looks weird. seems some message are gone. Though messages has been sent out from clients, but they does not arrive at server.For now, I have a WAR for this: https://github.com/dmlc/dgl/pull/3867. It fixes the hang issue (at least in my several tests) though the root cause is not found yet. This WAR aims to send messages blockingly and this may slow down the message send speed. so performance degrade could happen.
I will continue on this until root cause is found.
To Reproduce
//tools/launch.py --num_trainers 16 .... //examples/pytorch/graphsage/dist/train_dist.py
Expected behavior
Environment
conda
,pip
, source): pipAdditional context