Open raninbowlalala opened 1 year ago
Here is what seems to be the scenario
Rank A:
AllGather: count 8
AllReduce: count 3 datatype 7 op 0
AllReduce: count 2 datatype 7 op 0
Rank B:
AllGather: count 8
AllReduce: count 2 datatype 7 op 0
AllReduce: count 3 datatype 7 op 0
So, the allgather is launched first, then Rank A launches allreduce operations in reverse order compared to Rank B.
There could be many reasons for that to hang:
CUDA_LAUNCH_BLOCKING
to mpirun; not sure what the value of that is. If it's set to 1 then it's expected to hang, given each rank would wait for a different operation to complete before launching the next one.@sjeaugey Thanks for the quick response!
gdb -p <pid>
then run thread apply all bt
. That will give you the backtrace of each thread of the process.I use cuDevicePrimaryCtxRetain to create a gpu context, and create 3 non-blocking stream(with same context) for each operation, does this cause hung?
The log got from node1: (gdb) thread apply all bt
Thread 21 (Thread 0x7f73d9fff700 (LWP 8248)):
Thread 20 (Thread 0x7f7482749700 (LWP 8247)):
::StdThreadingEnvironment>::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so
Thread 19 (Thread 0x7f73e4ffd700 (LWP 8246)):
Thread 18 (Thread 0x7f73e57fe700 (LWP 8245)):
andle>, tfrt::Argument
Thread 17 (Thread 0x7f73e5fff700 (LWP 8244)):
Thread 16 (Thread 0x7f73f4ffd700 (LWP 8243)):
Thread 15 (Thread 0x7f73f5fff700 (LWP 8242)):
Thread 14 (Thread 0x7f7482f4a700 (LWP 8241)):
Thread 13 (Thread 0x7f73f57fe700 (LWP 8240)):
--Type
::StdThreadingEnvironment>::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so
Thread 12 (Thread 0x7f7418be0700 (LWP 8228)):
Thread 11 (Thread 0x7f74193e1700 (LWP 8227)):
Thread 10 (Thread 0x7f7480f46700 (LWP 8226)):
Thread 9 (Thread 0x7f7481747700 (LWP 8225)):
Thread 8 (Thread 0x7f74846e2700 (LWP 8221)):
--Type
Thread 7 (Thread 0x7f7484ee3700 (LWP 8220)):
Thread 6 (Thread 0x7f74a1d88700 (LWP 8219)):
Thread 5 (Thread 0x7f74a2746700 (LWP 8218)):
Thread 4 (Thread 0x7f74a2f9a700 (LWP 8217)):
--Type
Thread 3 (Thread 0x7f74a379b700 (LWP 8216)):
Thread 2 (Thread 0x7f74a6bb2700 (LWP 8215)):
Thread 1 (Thread 0x7f7503b63740 (LWP 8212)):
--Type
x7f75039230e0, callable@entry=
--Type
The log got from node0: (gdb) thread apply all bt
Thread 18 (Thread 0x7f0498ffd700 (LWP 135332)):
Thread 17 (Thread 0x7f04997fe700 (LWP 135331)):
--Type
Thread 15 (Thread 0x7f053d7fe700 (LWP 135328)):
Thread 14 (Thread 0x7f053cffd700 (LWP 135327)):
, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1})::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so
Thread 13 (Thread 0x7f04fc93f700 (LWP 135326)):
ument
Thread 12 (Thread 0x7f0551fff700 (LWP 135308)):
Thread 11 (Thread 0x7f0558b25700 (LWP 135307)):
Thread 10 (Thread 0x7f0559330700 (LWP 135306)):
Thread 9 (Thread 0x7f0559b3b700 (LWP 135305)):
--Type
Thread 8 (Thread 0x7f055b2ee700 (LWP 135304)):
Thread 7 (Thread 0x7f055baef700 (LWP 135303)):
Thread 6 (Thread 0x7f057894b700 (LWP 135302)):
Thread 5 (Thread 0x7f0579307700 (LWP 135301)):
Thread 4 (Thread 0x7f0579b5b700 (LWP 135300)):
--Type
Thread 3 (Thread 0x7f057a35c700 (LWP 135299)):
Thread 2 (Thread 0x7f057d777700 (LWP 135298)):
Thread 1 (Thread 0x7f05da731740 (LWP 135295)):
Hello, I have 2 nodes and 1 gpu in each node. NCCL versio is2.14.3, NCCL_LAUNCH_MODE=GROUP,NCCL_DEBUG_SUBSYS=COLL,NET,P2P I have 2 allreduce and 1 allgather in 1 gpu, but the 3 communication ops do out of order execution, so I create uniqueId 3 times, and call ncclCommInitRank() 6 times to make sure each coomunication will get right data. But I will hang sometimes, the hang log as below: