Hello, I have 2 nodes and 1 gpu in each node. NCCL versio is2.14.3， NCCL_LAUNCH_MODE=GROUP,NCCL_DEBUG_SUBSYS=COLL,NET,P2P I have 2 allreduce and 1 allgather in 1 gpu, but the 3 communication ops do out of order execution, so I create uniqueId 3 times, and call ncclCommInitRank() 6 times to make sure each coomunication will get right data. But I will hang sometimes, the hang log as below:

mpirun --allow-run-as-root -hostfile example/my-hostfile.txt -mca btl_openib_allow_ib 1 -mca btl_tcp_if_include 10.0.1.0/24  -mca orte_base_help_aggregate 0 -x PATH -x LD_LIBRARY_PATH -x NCCL_LAUNCH_MODE -x NCCL_DEBUG -x NCCL_DEBUG_SUBSYS -x CUDA_LAUNCH_BLOCKING  python3 example/test_all_reduce/test.py
6a5a91752e39:74352:74355 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
6a5a91752e39:74352:74355 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB ; OOB eth0:10.0.1.2<0>
21346d5cefce:5325:5338 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB ; OOB eth0:10.0.1.4<0>
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' (no module)
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' (no module)
21346d5cefce:5325:5338 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
6a5a91752e39:74352:74372 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
21346d5cefce:5325:5338 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
6a5a91752e39:74352:74372 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
21346d5cefce:5325:5338 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
6a5a91752e39:74352:74370 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
21346d5cefce:5325:5337 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
6a5a91752e39:74352:74372 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
21346d5cefce:5325:5338 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
6a5a91752e39:74352:74371 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
21346d5cefce:5325:5337 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
6a5a91752e39:74352:74370 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
21346d5cefce:5325:5337 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
6a5a91752e39:74352:74372 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
21346d5cefce:5325:5336 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
6a5a91752e39:74352:74371 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 7242 mtu 5 LID 9
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 6990 mtu 5 LID 7
21346d5cefce:5325:5337 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
6a5a91752e39:74352:74370 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
21346d5cefce:5325:5336 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
6a5a91752e39:74352:74371 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
21346d5cefce:5325:5338 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 7243 mtu 5 LID 9
21346d5cefce:5325:5336 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
6a5a91752e39:74352:74370 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
6a5a91752e39:74352:74371 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
6a5a91752e39:74352:74372 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 6991 mtu 5 LID 7
21346d5cefce:5325:5336 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 7246 mtu 5 LID 9
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 7245 mtu 5 LID 9
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 6993 mtu 5 LID 7
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 6992 mtu 5 LID 7
21346d5cefce:5325:5347 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f27f724fa00 recvbuff 0x7f27f724f800 count 8 datatype 0 op 0 root 0 comm 0x7f27a30bbb70 [nranks=2] stream 0x7f2c34000d30
21346d5cefce:5325:5337 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 7249 mtu 5 LID 9
21346d5cefce:5325:5336 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 7248 mtu 5 LID 9
6a5a91752e39:74352:74370 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 6995 mtu 5 LID 7
6a5a91752e39:74352:74371 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 6996 mtu 5 LID 7
6a5a91752e39:74352:74377 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f5f6f250200 recvbuff 0x7f5f6f250000 count 8 datatype 0 op 0 root 0 comm 0x7f5ec8009280 [nranks=2] stream 0x7f5f44775820
21346d5cefce:5325:5351 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f27f7250200 recvbuff 0x7f27f7250000 count 3 datatype 7 op 0 root 0 comm 0x7f27b400be70 [nranks=2] stream 0x7f2c34000d10
21346d5cefce:5325:5349 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f27f724fe00 recvbuff 0x7f27f724fc00 count 2 datatype 7 op 0 root 0 comm 0x7f27ac009280 [nranks=2] stream 0x7f2c34000e50
6a5a91752e39:74352:74371 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f5f6f24fe00 recvbuff 0x7f5f6f24fc00 count 2 datatype 7 op 0 root 0 comm 0x7f5ed4009280 [nranks=2] stream 0x7f5f447786b0
0x7f5f6f24f800
6a5a91752e39:74352:74370 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f5f6f24fa00 recvbuff 0x7f5f6f24f800 count 3 datatype 7 op 0 root 0 comm 0x7f5ed00092b0 [nranks=2] stream 0x7f5f44775800

Here is what seems to be the scenario

Rank A:

AllGather: count 8
AllReduce: count 3 datatype 7 op 0 
AllReduce: count 2 datatype 7 op 0

Rank B:

AllGather: count 8
AllReduce: count 2 datatype 7 op 0
AllReduce: count 3 datatype 7 op 0

So, the allgather is launched first, then Rank A launches allreduce operations in reverse order compared to Rank B.

There could be many reasons for that to hang:

You are passing CUDA_LAUNCH_BLOCKING to mpirun; not sure what the value of that is. If it's set to 1 then it's expected to hang, given each rank would wait for a different operation to complete before launching the next one.
You are launching the two operations on different streams and the streams are not non-blocking.
CUDA causes an implicit synchronization of the device during one of the second calls, due to e.g. some memory allocation. Getting the backtrace of each rank would tell us that.
The two streams end up being serialized because they use the same resource somewhere. CUDA does not give guarantee of concurrent progress between non-blocking streams. Having non-blocking streams is a requirement to have concurrent progress but not a guarantee.

@sjeaugey Thanks for the quick response！

I set CUDA_LAUNCH_BLOCKING=0.
I create different non-blocking stream for each operation.
Could you help me with how to get the backtrace of each rank? I don't konw how to do that.
How to avoid this situation like you said？If they use the same resource somewhere.

When hung, you can attach to each process with gdb -p <pid> then run thread apply all bt. That will give you the backtrace of each thread of the process.
There isn't much we can do about that; as mentioned, CUDA does not give any guarantee because it can't probably due to limited HW resources inside the GPU. There could be way inside CUDA to try to avoid the issue, by maybe creating a different subcontext, but I don't have a good recipe; it's just not supported, we're playing in unsafe territory here.

I use cuDevicePrimaryCtxRetain to create a gpu context, and create 3 non-blocking stream(with same context) for each operation, does this cause hung?

The log got from node1: (gdb) thread apply all bt

Thread 21 (Thread 0x7f73d9fff700 (LWP 8248)):

0 0x00007f7503ec3376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f898f2eb in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f898f5d5 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

4 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 20 (Thread 0x7f7482749700 (LWP 8247)):

0 0x00007f7503ebdcd7 in __pthread_clockjoin_ex () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f898ffd7 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f8976ae8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f8976d10 in ncclCommDestroy () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f750349fb89 in tfrt::gpu::wrapper::NcclCommDestroy(ncclComm*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

5 0x00007f750349eaf1 in tfrt::gpu::wrapper::CclCommDestroy(tfrt::gpu::wrapper::CclComm) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

6 0x00007f750349eb3a in tfrt::gpu::wrapper::internal::CclCommDeleter::operator()(tfrt::gpu::wrapper::CclComm) const () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

7 0x00007f75034957f3 in tfrt::gpu::GpuCclHandle::~GpuCclHandle() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

8 0x00007f7503352011 in tfrt::AsyncValue::MakeTypeInfo<tfrt::internal::ConcreteAsyncValue >()::{lambda(tfrt::AsyncValue, bool)#1}::_FUN(tfrt::AsyncValue, bool) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

9 0x00007f7503351883 in tfrt::AsyncValue::DropRef(unsigned int) [clone .part.0] [clone .constprop.0] () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

10 0x00007f75033519ad in void llvm::detail::UniqueFunctionBase<llvm::Expected>::DestroyImpl<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

11 0x00007f7503355936 in void llvm::detail::UniqueFunctionBase::CallImpl<tfrt::RunBlockingWork<tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1}, int, 0>(tfrt::HostContext, tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1})::{lambda()#1}>(void) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

12 0x00007f750352df96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue<tfrt::internal--Type for more, q to quit, c to continue without paging--

::StdThreadingEnvironment>::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

13 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

14 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

15 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 19 (Thread 0x7f73e4ffd700 (LWP 8246)):

0 0x00007f7503c6a71b in sched_yield () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f898f68d in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 18 (Thread 0x7f73e57fe700 (LWP 8245)):

0 0x00007f7503ebdcd7 in __pthread_clockjoin_ex () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f898ffd7 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f8976ae8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f8976d10 in ncclCommDestroy () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f750349fb89 in tfrt::gpu::wrapper::NcclCommDestroy(ncclComm*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

5 0x00007f750349eaf1 in tfrt::gpu::wrapper::CclCommDestroy(tfrt::gpu::wrapper::CclComm) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

6 0x00007f750349eb3a in tfrt::gpu::wrapper::internal::CclCommDeleter::operator()(tfrt::gpu::wrapper::CclComm) const () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

7 0x00007f75034957f3 in tfrt::gpu::GpuCclHandle::~GpuCclHandle() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

8 0x00007f7503352011 in tfrt::AsyncValue::MakeTypeInfo<tfrt::internal::ConcreteAsyncValue >()::{lambda(tfrt::AsyncValue, bool)#1}::_FUN(tfrt::AsyncValue, bool) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

9 0x00007f7503351883 in tfrt::AsyncValue::DropRef(unsigned int) [clone .part.0] [clone .constprop.0] () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

10 0x00007f75033519ad in void llvm::detail::UniqueFunctionBase<llvm::Expected>::DestroyImpl<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

11 0x00007f7503355936 in void llvm::detail::UniqueFunctionBase::CallImpl<tfrt::RunBlockingWork<tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1}, int, 0>(tfrt::HostContext*, tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument<tfrt::gpu::GpuCclH--Type for more, q to quit, c to continue without paging--

andle>, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1})::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

12 0x00007f750352df96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

13 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

14 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

15 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 17 (Thread 0x7f73e5fff700 (LWP 8244)):

0 0x00007f7503ec765c in recv () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f899eee8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f89a3a40 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f89a4011 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f73f8997a6c in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

5 0x00007f73f898f654 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

6 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

7 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 16 (Thread 0x7f73f4ffd700 (LWP 8243)):

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f8981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 15 (Thread 0x7f73f5fff700 (LWP 8242)):

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f8981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 14 (Thread 0x7f7482f4a700 (LWP 8241)):

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f8981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 13 (Thread 0x7f73f57fe700 (LWP 8240)): --Type for more, q to quit, c to continue without paging--

0 0x00007ffdb4ce9b62 in clock_gettime ()

1 0x00007f7503c450b5 in clock_gettime () from /usr/lib/x86_64-linux-gnu/libc.so.6

2 0x00007f74a9dbb12f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f74a9cb3dcb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f74a9d9e85c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

5 0x00007f74a9d8253a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

6 0x00007f74a9ff2aca in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

7 0x00007f74a9c63a93 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

8 0x00007f74a9fd2ef1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

9 0x00007f74a9c5c020 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

10 0x00007f74a9e2864e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

11 0x00007f73f8a16e05 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

12 0x00007f73f89e8925 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

13 0x00007f73f8a354a9 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

14 0x00007f73f897137b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

15 0x00007f73f8975ba0 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

16 0x00007f73f8976111 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

17 0x00007f73f897623b in ncclCommInitRank () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

18 0x00007f750349fa85 in tfrt::gpu::wrapper::NcclCommInitRank(tfrt::gpu::wrapper::CurrentContext, int, ncclUniqueId, int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

19 0x00007f750349e9b4 in tfrt::gpu::wrapper::CclCommInitRank(tfrt::gpu::wrapper::CurrentContext, int, ncclUniqueId, int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

20 0x00007f7503354e7e in tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}::operator()() const () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

21 0x00007f7503355126 in llvm::Expected llvm::detail::UniqueFunctionBase<llvm::Expected>::CallImpl<tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

22 0x00007f7503355231 in void llvm::detail::UniqueFunctionBase::CallImpl<tfrt::RunBlockingWork<tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1}, tfrt::gpu::GpuCclHandle, 0>(tfrt::HostContext, tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1})::{lambda()#1}>(void) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

23 0x00007f750352df96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue<tfrt::internal--Type for more, q to quit, c to continue without paging--

::StdThreadingEnvironment>::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

24 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

25 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

26 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 12 (Thread 0x7f7418be0700 (LWP 8228)):

0 0x00007f7503ec73cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74a0ced294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f73f8991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f899e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 11 (Thread 0x7f74193e1700 (LWP 8227)):

0 0x00007f7503ec73cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74a0ced294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f73f8991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f899e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 10 (Thread 0x7f7480f46700 (LWP 8226)):

0 0x00007f7503ec73cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74a0ced294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f73f8991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f899e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7f7481747700 (LWP 8225)):

0 0x00007f7503ec73cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74a0ced294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f73f8991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f899e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 8 (Thread 0x7f74846e2700 (LWP 8221)): --Type for more, q to quit, c to continue without paging--

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f74a9dbbbc9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

2 0x00007f74a9e62d3b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f74a9db6bd8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7f7484ee3700 (LWP 8220)):

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f74a9dbbbc9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

2 0x00007f74a9e62d3b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f74a9db6bd8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7f74a1d88700 (LWP 8219)):

0 0x00007f7503c8746e in epoll_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f74e26f55e9 in ?? () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

2 0x00007f74e26eb625 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

3 0x00007f74a1eb1d56 in ?? () from /usr/lib/x86_64-linux-gnu/libpmix.so.2

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7f74a2746700 (LWP 8218)):

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f74e26f4981 in ?? () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

2 0x00007f74e26eb625 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

3 0x00007f74f811f706 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7f74a2f9a700 (LWP 8217)):

0 0x00007f7503ec3376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74f8344e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f750353ddd9 in tfrt::TimerQueue::TimerThreadRun() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

4 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

--Type for more, q to quit, c to continue without paging--

5 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f74a379b700 (LWP 8216)):

0 0x00007f7503ec3376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74f8344e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f750353242b in tfrt::internal::WorkQueueBase<tfrt::internal::BlockingWorkQueue >::WaitForWork(tfrt::internal::EventCount::Waiter, llvm::Optional<llvm::unique_function<void ()> >) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f7503532b3d in tfrt::internal::WorkQueueBase<tfrt::internal::BlockingWorkQueue >::WorkerLoop(int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

4 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

5 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f74a6bb2700 (LWP 8215)):

0 0x00007f7503ec3376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74f8344e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f75035311c4 in tfrt::internal::WorkQueueBase<tfrt::internal::NonBlockingWorkQueue >::WaitForWork(tfrt::internal::EventCount::Waiter, llvm::Optional<llvm::unique_function<void ()> >) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f75035319c0 in tfrt::internal::WorkQueueBase<tfrt::internal::NonBlockingWorkQueue >::WorkerLoop(int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

4 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

5 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f7503b63740 (LWP 8212)):

0 0x00007f7503ec3376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f74f8344e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f750353075b in tfrt::internal::BlockingWorkQueue::Quiesce() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f75035307e7 in tfrt::MultiThreadedWorkQueue::Quiesce() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

4 0x00007f75032e3b00 in tfrt::RunBefFunction(tfrt::HostContext, tfrt::Function const&, std::function<llvm::Expected (tfrt::HostContext, tfrt::ResourceContext*)> const&, bool) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

--Type for more, q to quit, c to continue without paging--

5 0x00007f75032e502a in tfrt::UDLCBefRun(tfrt::HostResource, tfrt::RunBefConfig const&, tfrt::RCReference, std::function<llvm::Expected (tfrt::HostContext, tfrt::ResourceContext*)> const&) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

6 0x00007f75032e52ec in tfrt::UDLCBefRun(tfrt::HostResource*, tfrt::RunBefConfig const&, tfrt::RCReference) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

7 0x00007f75032c28ff in udlc::runtime::pyRuntime::execute() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

8 0x00007f75032c66c8 in udlc::runtime::pyRuntime::UDLCRun(std::vector<pybind11::buffer, std::allocator > const&) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

9 0x00007f75032de834 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<std::vector<udlc::runtime::MemRef, std::allocator >, udlc::runtime::pyRuntime, std::vector<pybind11::buffer, std::allocator > const&, pybind11::name, pybind11::is_method, pybind11::sibling, char [12]>(std::vector<udlc::runtime::MemRef, std::allocator > (udlc::runtime::pyRuntime::)(std::vector<pybind11::buffer, std::allocator > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [12])::{lambda(udlc::runtime::pyRuntime, std::vector<pybind11::buffer, std::allocator > const&)#1}, std::vector<udlc::runtime::MemRef, std::allocator >, udlc::runtime::pyRuntime, std::vector<pybind11::buffer, std::allocator > const&, pybind11::name, pybind11::is_method, pybind11::sibling, char [12]>(pybind11::cpp_function::initialize<std::vector<udlc::runtime::MemRef, std::allocator >, udlc::runtime::pyRuntime, std::vector<pybind11::buffer, std::allocator > const&, pybind11::name, pybind11::is_method, pybind11::sibling, char [12]>(std::vector<udlc::runtime::MemRef, std::allocator > (udlc::runtime::pyRuntime::)(std::vector<pybind11::buffer, std::allocator > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [12])::{lambda(udlc::runtime::pyRuntime, std::vector<pybind11::buffer, std::allocator > const&)#1}&&, std::vector<udlc::runtime::MemRef, std::allocator > ()(udlc::runtime::pyRuntime*, std::vector<pybind11::buffer, std::allocator > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [12])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

10 0x00007f75032d15ea in pybind11::cpp_function::dispatcher(_object, _object, _object*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

11 0x00000000004f5652 in cfunction_call_varargs (kwargs=, args=, func=0x7f75039230e0) at /usr/local/src/conda/python-3.8.16/Objects/call.c:743

12 PyCFunction_Call (func=0x7f75039230e0, args=, kwargs=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:773

13 0x00000000004e0c8b in _PyObject_MakeTpCall (callable=0x7f75039230e0, args=, nargs=, keywords=) at /usr/local/src/conda/python-3.8.16/Objects/typeobject.c:3727

14 0x00000000004f53fd in _PyObject_Vectorcall (kwnames=, nargsf=, args=0x7f75039ac1d0, callable=) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:125

15 _PyObject_Vectorcall (kwnames=0x0, kwnames@entry=, nargsf=, nargsf@entry=, args=0x7f75039ac1d0, args@entry=, callable=0--Type for more, q to quit, c to continue without paging--

x7f75039230e0, callable@entry=) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:115

16 method_vectorcall (method=, args=0x7f75039ac1d8, nargsf=, kwnames=0x0) at /usr/local/src/conda/python-3.8.16/Objects/classobject.c:60

17 0x00000000004dc999 in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7f75039ac1d8, callable=0x7f7503ad1340) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:127

18 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x9fadb0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4963

19 _PyEval_EvalFrameDefault (f=, throwflag=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:3469

20 0x00000000004d6fb1 in PyEval_EvalFrameEx (throwflag=, f=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:741

21 _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x7f75039ac608, kwcount=, kwstep=1, defs=0x0, defcount=, kwdefs=0x0, closure=0x0, name=0x7f750391afb0, qualname=0x7f750391afb0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4298

22 0x00000000004e807c in _PyFunction_Vectorcall (func=, stack=0x7f75039ac5f0, nargsf=, kwnames=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:436

23 0x00000000004d8389 in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7f75039ac5f0, callable=0x7f7503a9c1f0) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:127

24 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x9fadb0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4963

25 _PyEval_EvalFrameDefault (f=, throwflag=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:3500

26 0x00000000004e7fe6 in PyEval_EvalFrameEx (throwflag=, f=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:738

27 function_code_fastcall (globals=, nargs=0, nargs@entry=, args=, co=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:284

28 _PyFunction_Vectorcall (func=, stack=0x7f7503a385b0, nargsf=, kwnames=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:411

29 0x00000000004d8389 in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7f7503a385b0, callable=0x7f74a37a4a60) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:127

30 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x9fadb0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4963

31 _PyEval_EvalFrameDefault (f=, throwflag=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:3500

--Type for more, q to quit, c to continue without paging--

32 0x00000000004d6fb1 in PyEval_EvalFrameEx (throwflag=, f=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:741

33 _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x0, kwcount=, kwstep=2, defs=0x0, defcount=, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4298

34 0x0000000000585d79 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=, kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4327

35 0x0000000000585d3b in PyEval_EvalCode (co=co@entry=0x7f750394d450, globals=globals@entry=0x7f7503a77f80, locals=locals@entry=0x7f7503a77f80) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:718

36 0x00000000005a5a91 in run_eval_code_obj (co=0x7f750394d450, globals=0x7f7503a77f80, locals=0x7f7503a77f80) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:1166

37 0x00000000005a4a9f in run_mod (mod=, filename=, globals=0x7f7503a77f80, locals=0x7f7503a77f80, flags=, arena=) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:1188

38 0x000000000045c417 in pyrun_file (fp=fp@entry=0x9f8440, filename=filename@entry=0x7f7503a5c620, start=start@entry=257, globals=globals@entry=0x7f7503a77f80, locals=locals@entry=0x7f7503a77f80, closeit=closeit@entry=1, flags=0x7ffdb4cc9818) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:1085

39 0x000000000045bfb8 in pyrun_simple_file (flags=0x7ffdb4cc9818, closeit=1, filename=0x7f7503a5c620, fp=0x9f8440) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:439

40 PyRun_SimpleFileExFlags (fp=fp@entry=0x9f8440, filename=, closeit=closeit@entry=1, flags=flags@entry=0x7ffdb4cc9818) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:472

41 0x000000000045bbb7 in PyRun_AnyFileExFlags (fp=fp@entry=0x9f8440, filename=, closeit=closeit@entry=1, flags=flags@entry=0x7ffdb4cc9818) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:90

42 0x000000000044fd9e in pymain_run_file (cf=0x7ffdb4cc9818, config=0x9fa320) at /usr/local/src/conda/python-3.8.16/Modules/main.c:391

43 pymain_run_python (exitcode=0x7ffdb4cc9810) at /usr/local/src/conda/python-3.8.16/Modules/main.c:616

44 Py_RunMain () at /usr/local/src/conda/python-3.8.16/Modules/main.c:695

45 0x0000000000579dd9 in Py_BytesMain (argc=, argv=) at /usr/local/src/conda/python-3.8.16/Modules/main.c:1127

46 0x00007f7503b8c083 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6

47 0x0000000000579c8d in _start () at /usr/local/src/conda/python-3.8.16/Objects/classobject.c:521

The log got from node0: (gdb) thread apply all bt

Thread 18 (Thread 0x7f0498ffd700 (LWP 135332)):

0 0x00007f05daa8fe8b in pthread_rwlock_wrlock () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f0580bb3641 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

2 0x00007f058081e17f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f05809bf99d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f04fc9e6a92 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

5 0x00007f04fca2158e in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

6 0x00007f04fc988b06 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

7 0x00007f04fc98900b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

8 0x00007f04fc98d6e6 in ncclGroupEnd () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

9 0x00007f05da06e256 in tfrt::gpu::wrapper::NcclGroupEnd() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

10 0x00007f05da06d3e5 in tfrt::gpu::wrapper::CclGroupEnd(tfrt::gpu::wrapper::Platform) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

11 0x00007f05da0639b3 in tfrt::gpu::GpuCclHandle::ExecuteCallbacks(tfrt::gpu::wrapper::CurrentContext, tfrt::gpu::wrapper::Resource<CUstream_st, ihipStream_t>) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

12 0x00007f05d9f267e4 in llvm::Expected llvm::detail::UniqueFunctionBase<llvm::Expected>::CallImpl<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

13 0x00007f05d9f23901 in void llvm::detail::UniqueFunctionBase::CallImpl<tfrt::RunBlockingWork<tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1}, int, 0>(tfrt::HostContext, tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1})::{lambda()#1}>(void) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

14 0x00007f05da0fbf96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

15 0x00007f05cef18de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

16 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

17 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 17 (Thread 0x7f04997fe700 (LWP 135331)):

0 0x00007f05da83871b in sched_yield () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f04fc98f68d in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

--Type for more, q to quit, c to continue without paging-- Thread 16 (Thread 0x7f04a4ffd700 (LWP 135329)):

0 0x00007f05da84899f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f04fc981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 15 (Thread 0x7f053d7fe700 (LWP 135328)):

0 0x00007f05da84899f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f04fc981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 14 (Thread 0x7f053cffd700 (LWP 135327)):

0 0x00007f05daa8f5df in pthread_rwlock_rdlock () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f0580829543 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

2 0x00007f0580829f91 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f058082ab28 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f05809ec2f1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

5 0x00007f04fca16ff9 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

6 0x00007f04fc9e8a2d in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

7 0x00007f04fca371e6 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

8 0x00007f04fc985a07 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

9 0x00007f04fc974aa3 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

10 0x00007f04fc975a75 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

11 0x00007f04fc976111 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

12 0x00007f04fc97623b in ncclCommInitRank () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

13 0x00007f05da06da85 in tfrt::gpu::wrapper::NcclCommInitRank(tfrt::gpu::wrapper::CurrentContext, int, ncclUniqueId, int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

14 0x00007f05da06c9b4 in tfrt::gpu::wrapper::CclCommInitRank(tfrt::gpu::wrapper::CurrentContext, int, ncclUniqueId, int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

15 0x00007f05d9f22e7e in tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}::operator()() const () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

16 0x00007f05d9f23126 in llvm::Expected llvm::detail::UniqueFunctionBase<llvm::Expected>::CallImpl<tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

17 0x00007f05d9f23231 in void llvm::detail::UniqueFunctionBase::CallImpl<tfrt::RunBlockingWork<tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1}, tfrt::gpu::GpuCclHandle, 0>(tfrt::HostContext*, tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>--Type for more, q to quit, c to continue without paging--

, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclCreate(int, int, tfrt::gpu::wrapper::CclType<std::array<char, 128ul>, tfrt::gpu::wrapper::CclUniqueIdTag> const&, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1})::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

18 0x00007f05da0fbf96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

19 0x00007f05cef18de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

20 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

21 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 13 (Thread 0x7f04fc93f700 (LWP 135326)):

0 0x00007f0580ad977b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

1 0x00007f058085b076 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

2 0x00007f0580bbf3af in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f0580bc11ef in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f0580879bc7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

5 0x00007f0580bbf9f7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

6 0x00007f058099da32 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

7 0x00007f04fc9e739e in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

8 0x00007f04fca22ec8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

9 0x00007f04fc978aca in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

10 0x00007f04fc976bb8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

11 0x00007f04fc976d10 in ncclCommDestroy () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

12 0x00007f05da06db89 in tfrt::gpu::wrapper::NcclCommDestroy(ncclComm*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

13 0x00007f05da06caf1 in tfrt::gpu::wrapper::CclCommDestroy(tfrt::gpu::wrapper::CclComm) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

14 0x00007f05da06cb3a in tfrt::gpu::wrapper::internal::CclCommDeleter::operator()(tfrt::gpu::wrapper::CclComm) const () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

15 0x00007f05da0637f3 in tfrt::gpu::GpuCclHandle::~GpuCclHandle() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

16 0x00007f05d9f20011 in tfrt::AsyncValue::MakeTypeInfo<tfrt::internal::ConcreteAsyncValue >()::{lambda(tfrt::AsyncValue, bool)#1}::_FUN(tfrt::AsyncValue, bool) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

17 0x00007f05d9f1f883 in tfrt::AsyncValue::DropRef(unsigned int) [clone .part.0] [clone .constprop.0] () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

18 0x00007f05d9f1f9ad in void llvm::detail::UniqueFunctionBase<llvm::Expected>::DestroyImpl<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

19 0x00007f05d9f23936 in void llvm::detail::UniqueFunctionBase::CallImpl<tfrt::RunBlockingWork<tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1}, int, 0>(tfrt::HostContext*, tfrt::gpu::DestroyCapturesOnInvoke<tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Arg--Type for more, q to quit, c to continue without paging--

ument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}>(tfrt::gpu::UdlcCclExecute(tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::Argument, tfrt::StringAttribute, tfrt::ExecutionContext const&)::{lambda()#1}&&)::{lambda()#1})::{lambda()#1}>(void*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

20 0x00007f05da0fc159 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

21 0x00007f05cef18de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

22 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

23 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 12 (Thread 0x7f0551fff700 (LWP 135308)):

0 0x00007f05daa953cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05780da294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f04fc991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f04fc99e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 11 (Thread 0x7f0558b25700 (LWP 135307)):

0 0x00007f05daa953cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05780da294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f04fc991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f04fc99e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 10 (Thread 0x7f0559330700 (LWP 135306)):

0 0x00007f05daa953cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05780da294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f04fc991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f04fc99e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7f0559b3b700 (LWP 135305)):

0 0x00007f05daa953cc in read () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05780da294 in ibv_get_async_event () from /usr/lib/x86_64-linux-gnu/libibverbs.so.1

2 0x00007f04fc991646 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f04fc99e48b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

--Type for more, q to quit, c to continue without paging--

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 8 (Thread 0x7f055b2ee700 (LWP 135304)):

0 0x00007f05da84899f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f05809819c9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

2 0x00007f0580a28c8b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f058097c9d8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7f055baef700 (LWP 135303)):

0 0x00007f05da84899f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f05809819c9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

2 0x00007f0580a28c8b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f058097c9d8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7f057894b700 (LWP 135302)):

0 0x00007f05da85546e in epoll_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f05b92c35e9 in ?? () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

2 0x00007f05b92b9625 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

3 0x00007f0578a7bd56 in ?? () from /usr/lib/x86_64-linux-gnu/libpmix.so.2

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7f0579307700 (LWP 135301)):

0 0x00007f05da84899f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f05b92c2981 in ?? () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

2 0x00007f05b92b9625 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7

3 0x00007f05ceced706 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7f0579b5b700 (LWP 135300)):

0 0x00007f05daa91376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05cef12e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f05da10bdd9 in tfrt::TimerQueue::TimerThreadRun() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f05cef18de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

4 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

5 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

--Type for more, q to quit, c to continue without paging--

Thread 3 (Thread 0x7f057a35c700 (LWP 135299)):

0 0x00007f05daa91376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05cef12e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f05da10042b in tfrt::internal::WorkQueueBase<tfrt::internal::BlockingWorkQueue >::WaitForWork(tfrt::internal::EventCount::Waiter, llvm::Optional<llvm::unique_function<void ()> >) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f05da100b3d in tfrt::internal::WorkQueueBase<tfrt::internal::BlockingWorkQueue >::WorkerLoop(int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

4 0x00007f05cef18de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

5 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f057d777700 (LWP 135298)):

0 0x00007f05daa91376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05cef12e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f05da0ff1c4 in tfrt::internal::WorkQueueBase<tfrt::internal::NonBlockingWorkQueue >::WaitForWork(tfrt::internal::EventCount::Waiter, llvm::Optional<llvm::unique_function<void ()> >) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f05da0ff9c0 in tfrt::internal::WorkQueueBase<tfrt::internal::NonBlockingWorkQueue >::WorkerLoop(int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

4 0x00007f05cef18de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

5 0x00007f05daa8a609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

6 0x00007f05da855133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f05da731740 (LWP 135295)):

0 0x00007f05daa91376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f05cef12e30 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

2 0x00007f05da0f967b in tfrt::MultiThreadedWorkQueue::Await(llvm::ArrayRef<tfrt::RCReference >) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

3 0x00007f05d9eb18b1 in tfrt::RunBefFunction(tfrt::HostContext, tfrt::Function const&, std::function<llvm::Expected (tfrt::HostContext, tfrt::ResourceContext*)> const&, bool) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

4 0x00007f05d9eb302a in tfrt::UDLCBefRun(tfrt::HostResource, tfrt::RunBefConfig const&, tfrt::RCReference, std::function<llvm::Expected (tfrt::HostContext, tfrt::ResourceContext*)> const&) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

5 0x00007f05d9eb32ec in tfrt::UDLCBefRun(tfrt::HostResource*, tfrt::RunBefConfig const&, tfrt::RCReference) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

6 0x00007f05d9e908ff in udlc::runtime::pyRuntime::execute() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

7 0x00007f05d9e946c8 in udlc::runtime::pyRuntime::UDLCRun(std::vector<pybind11::buffer, std::allocator > const&) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

8 0x00007f05d9eac834 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<std::vector<udlc::runtime::MemRef, std::allocator--Type for more, q to quit, c to continue without paging--

>, udlc::runtime::pyRuntime, std::vector > const&, pybind11::name, pybind11::is_method, pybind11::sibling, char [12]>(std::vector > (udlc::runtime::pyRuntime::*)(std::vector > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [12])::{lambda(udlc::runtime::pyRuntime*, std::vector > const&)#1}, std::vector >, udlc::runtime::pyRuntime*, std::vector > const&, pybind11::name, pybind11::is_method, pybind11::sibling, char [12]>(pybind11::cpp_function::initialize >, udlc::runtime::pyRuntime, std::vector > const&, pybind11::name, pybind11::is_method, pybind11::sibling, char [12]>(std::vector > (udlc::runtime::pyRuntime::*)(std::vector > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [12])::{lambda(udlc::runtime::pyRuntime*, std::vector > const&)#1}&&, std::vector > (*)(udlc::runtime::pyRuntime*, std::vector > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [12])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so #9 0x00007f05d9e9f5ea in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so #10 0x00000000004f5652 in cfunction_call_varargs (kwargs=, args=, func=0x7f05da4fd040) at /usr/local/src/conda/python-3.8.16/Objects/call.c:743 #11 PyCFunction_Call (func=0x7f05da4fd040, args=, kwargs=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:773 #12 0x00000000004e0c8b in _PyObject_MakeTpCall (callable=0x7f05da4fd040, args=, nargs=, keywords=) at /usr/local/src/conda/python-3.8.16/Objects/typeobject.c:3727 #13 0x00000000004f53fd in _PyObject_Vectorcall (kwnames=, nargsf=, args=0x7f05da57c1d0, callable=) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:125 #14 _PyObject_Vectorcall (kwnames=0x0, kwnames@entry=, nargsf=, nargsf@entry=, args=0x7f05da57c1d0, args@entry=, callable=0x7f05da4fd040, callable@entry=) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:115 #15 method_vectorcall (method=, args=0x7f05da57c1d8, nargsf=, kwnames=0x0) at /usr/local/src/conda/python-3.8.16/Objects/classobject.c:60 #16 0x00000000004dc999 in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7f05da57c1d8, callable=0x7f05da544700) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:127 #17 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x14e9f00) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4963 #18 _PyEval_EvalFrameDefault (f=, throwflag=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:3469 #19 0x00000000004d6fb1 in PyEval_EvalFrameEx (throwflag=, f=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:741 #20 _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount= for more, q to quit, c to continue without paging-- >, kwnames=, kwargs=0x7f05da57c608, kwcount=, kwstep=1, defs=0x0, defcount=, kwdefs=0x0, closure=0x0, name=0x7f05da4ebaf0, qualname=0x7f05da4ebaf0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4298 #21 0x00000000004e807c in _PyFunction_Vectorcall (func=, stack=0x7f05da57c5f0, nargsf=, kwnames=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:436 #22 0x00000000004d8389 in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7f05da57c5f0, callable=0x7f05da66b1f0) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:127 #23 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x14e9f00) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4963 #24 _PyEval_EvalFrameDefault (f=, throwflag=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:3500 #25 0x00000000004e7fe6 in PyEval_EvalFrameEx (throwflag=, f=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:738 #26 function_code_fastcall (globals=, nargs=0, nargs@entry=, args=, co=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:284 #27 _PyFunction_Vectorcall (func=, stack=0x7f05da6075b0, nargsf=, kwnames=) at /usr/local/src/conda/python-3.8.16/Objects/call.c:411 #28 0x00000000004d8389 in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7f05da6075b0, callable=0x7f057a366a60) at /usr/local/src/conda/python-3.8.16/Include/cpython/abstract.h:127 #29 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x14e9f00) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4963 #30 _PyEval_EvalFrameDefault (f=, throwflag=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:3500 #31 0x00000000004d6fb1 in PyEval_EvalFrameEx (throwflag=, f=) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:741 #32 _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x0, kwcount=, kwstep=2, defs=0x0, defcount=, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4298 #33 0x0000000000585d79 in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=, kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:4327 #34 0x0000000000585d3b in PyEval_EvalCode (co=co@entry=0x7f05da51c450, globals=globals@entry=0x7f05da645f40, locals=locals@entry=0x7f05da645f40) at /usr/local/src/conda/python-3.8.16/Python/ceval.c:718 #35 0x00000000005a5a91 in run_eval_code_obj (co=0x7f05da51c450, globals=0x7f05da645f40, locals=0x7f05da645f40) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:1166 #36 0x00000000005a4a9f in run_mod (mod=, filename=, globals=0x7f05da645f40, locals=0x7f05da645f40, flags=, arena=) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:1188 #37 0x000000000045c417 in pyrun_file (fp=fp@entry=0x14e7440, filename=filename@entry=0x7f05da62a620, start=start@entry=257, globals=globals@entry=0x7f05da645f40, locals=locals@entry=0x7f05da645f40, closeit=closeit@entry=1, flags=0x7ffe192bd4b8) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:1085 #38 0x000000000045bfb8 in pyrun_simple_file (flags=0x7ffe192bd4b8, closeit=1, filename=0x7f05da62a620, fp=0x14e7440) at /usr/local/src/conda/pyth--Type for more, q to quit, c to continue without paging-- on-3.8.16/Python/pythonrun.c:439 #39 PyRun_SimpleFileExFlags (fp=fp@entry=0x14e7440, filename=, closeit=closeit@entry=1, flags=flags@entry=0x7ffe192bd4b8) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:472 #40 0x000000000045bbb7 in PyRun_AnyFileExFlags (fp=fp@entry=0x14e7440, filename=, closeit=closeit@entry=1, flags=flags@entry=0x7ffe192bd4b8) at /usr/local/src/conda/python-3.8.16/Python/pythonrun.c:90 #41 0x000000000044fd9e in pymain_run_file (cf=0x7ffe192bd4b8, config=0x14e9470) at /usr/local/src/conda/python-3.8.16/Modules/main.c:391 #42 pymain_run_python (exitcode=0x7ffe192bd4b0) at /usr/local/src/conda/python-3.8.16/Modules/main.c:616 #43 Py_RunMain () at /usr/local/src/conda/python-3.8.16/Modules/main.c:695 #44 0x0000000000579dd9 in Py_BytesMain (argc=, argv=) at /usr/local/src/conda/python-3.8.16/Modules/main.c:1127 #45 0x00007f05da75a083 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6 #46 0x0000000000579c8d in _start () at /usr/local/src/conda/python-3.8.16/Objects/classobject.c:521

NVIDIA / nccl

2 allreduce and a allgather hang in multi-node #899

0 0x00007f7503ec3376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f898f2eb in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f898f5d5 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

4 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007f7503ebdcd7 in __pthread_clockjoin_ex () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f898ffd7 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f8976ae8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f8976d10 in ncclCommDestroy () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f750349fb89 in tfrt::gpu::wrapper::NcclCommDestroy(ncclComm*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

5 0x00007f750349eaf1 in tfrt::gpu::wrapper::CclCommDestroy(tfrt::gpu::wrapper::CclComm) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

6 0x00007f750349eb3a in tfrt::gpu::wrapper::internal::CclCommDeleter::operator()(tfrt::gpu::wrapper::CclComm) const () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

7 0x00007f75034957f3 in tfrt::gpu::GpuCclHandle::~GpuCclHandle() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

8 0x00007f7503352011 in tfrt::AsyncValue::MakeTypeInfo<tfrt::internal::ConcreteAsyncValue >()::{lambda(tfrt::AsyncValue, bool)#1}::_FUN(tfrt::AsyncValue, bool) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

9 0x00007f7503351883 in tfrt::AsyncValue::DropRef(unsigned int) [clone .part.0] [clone .constprop.0] () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

12 0x00007f750352df96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue<tfrt::internal--Type for more, q to quit, c to continue without paging--

13 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

14 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

15 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007f7503c6a71b in sched_yield () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f898f68d in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007f7503ebdcd7 in __pthread_clockjoin_ex () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f898ffd7 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f8976ae8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f8976d10 in ncclCommDestroy () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f750349fb89 in tfrt::gpu::wrapper::NcclCommDestroy(ncclComm*) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

5 0x00007f750349eaf1 in tfrt::gpu::wrapper::CclCommDestroy(tfrt::gpu::wrapper::CclComm) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

6 0x00007f750349eb3a in tfrt::gpu::wrapper::internal::CclCommDeleter::operator()(tfrt::gpu::wrapper::CclComm) const () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

7 0x00007f75034957f3 in tfrt::gpu::GpuCclHandle::~GpuCclHandle() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

8 0x00007f7503352011 in tfrt::AsyncValue::MakeTypeInfo<tfrt::internal::ConcreteAsyncValue >()::{lambda(tfrt::AsyncValue, bool)#1}::_FUN(tfrt::AsyncValue, bool) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

9 0x00007f7503351883 in tfrt::AsyncValue::DropRef(unsigned int) [clone .part.0] [clone .constprop.0] () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

12 0x00007f750352df96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue::RunBlockingTask(llvm::unique_function<void ()>)::{lambda()#3}> > >::_M_run() () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

13 0x00007f74f834ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

14 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

15 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007f7503ec765c in recv () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

1 0x00007f73f899eee8 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f73f89a3a40 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

3 0x00007f73f89a4011 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

4 0x00007f73f8997a6c in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

5 0x00007f73f898f654 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

6 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

7 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f8981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f8981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007f7503c7a99f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6

1 0x00007f73f8981aff in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

2 0x00007f7503ebc609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

3 0x00007f7503c87133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

0 0x00007ffdb4ce9b62 in clock_gettime ()

1 0x00007f7503c450b5 in clock_gettime () from /usr/lib/x86_64-linux-gnu/libc.so.6

2 0x00007f74a9dbb12f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

3 0x00007f74a9cb3dcb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

4 0x00007f74a9d9e85c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

5 0x00007f74a9d8253a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

6 0x00007f74a9ff2aca in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

7 0x00007f74a9c63a93 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

8 0x00007f74a9fd2ef1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

9 0x00007f74a9c5c020 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

10 0x00007f74a9e2864e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

11 0x00007f73f8a16e05 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

12 0x00007f73f89e8925 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

13 0x00007f73f8a354a9 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

14 0x00007f73f897137b in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

15 0x00007f73f8975ba0 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

16 0x00007f73f8976111 in ?? () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

17 0x00007f73f897623b in ncclCommInitRank () from /usr/lib/x86_64-linux-gnu/libnccl.so.2

18 0x00007f750349fa85 in tfrt::gpu::wrapper::NcclCommInitRank(tfrt::gpu::wrapper::CurrentContext, int, ncclUniqueId, int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

19 0x00007f750349e9b4 in tfrt::gpu::wrapper::CclCommInitRank(tfrt::gpu::wrapper::CurrentContext, int, ncclUniqueId, int) () from /workdir/UDLC/runtime/tfrt/example/test_all_reduce/pyRuntime.so

23 0x00007f750352df96 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<tfrt::internal::BlockingWorkQueue<tfrt::internal--Type for more, q to quit, c to continue without paging--