wujingyue commented 1 week ago

It runs OK in GitHub CI, which runs with V100x4 and A100x4, but fails consistently on H100.

@csarofeen and I managed to reproduce this on viking-prod-231 in partition viking-prod-pjnl.

$ git rev-parse HEAD
61a77e0a64d5bc446ba1c009f04a19204a28eab2

$ _bn && NVIDIA_TF32_OVERRIDE=0 mpirun -np 4 bin/test_multidevice

Other tests pass with CommunicationTest.SendRecv/UCC excluded.

$ _bn && NVIDIA_TF32_OVERRIDE=0 mpirun -np 4 bin/test_multidevice --gtest_filter=-CommunicationTest.SendRecv/UCC

wujingyue commented 1 week ago

@samnordmann would you mind taking a look?

samnordmann commented 1 week ago

I am able to reproduce the issue on viking H100 dgx node and am able to give an explanation of what is going on.

What

There is a known incompatibility between user's stream operations and UCX using nvLink over cuda-IPC, which can cause hangs. This is what we are seeing here. Both UCX and nvFuser post operations on the stream and this causes a deadlock.

Temporary workaround

We can disable the usage of cuda IPC in UCX by setting the flags UCX_RNDV_THRESH=0 and UCX_TLS=ib,cuda_copy. This way, the command

mpirun -np 4 -x UCX_RNDV_THRESH=0 -x UCX_TLS=ib,cuda_copy bin/test_multidevice --gtest_filter=-CommunicationTest.SendRecv/UCC

executes smoothly.

With those flags, UCX will GPU-direct RDMA, so we probably need the node to have a capable NIC. GPU-direct RDMA is stream-less, therefore there is no deadlock issue

Long Term fix

UCX and UCC team are working on a solution, as part of POR: https://redmine.mellanox.com/issues/3831841

Backtraces, for the record

Threads involved:

  Id   Target Id                                            Frame
  1    Thread 0x7fd5653fd000 (LWP 197328) "test_multidevic" 0x00007fd4e42a043c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
  2    Thread 0x7fd4ea7ff000 (LWP 197330) "fuse"            __GI___libc_read (nbytes=271, buf=0x7fd4ea7c8670, fd=4) at ../sysdeps/unix/sysv/linux/read.c:26
  3    Thread 0x7fd4e0361000 (LWP 197332) "cuda00002000009" 0x00007fd5662cfbcf in __GI___poll (fds=0x7fd4dbe01000, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  4    Thread 0x7fd2ceab3000 (LWP 197334) "cuda-EvtHandlr"  0x00007fd5662cfbcf in __GI___poll (fds=0x7fd2ca608000, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  5    Thread 0x7fd27addd000 (LWP 197336) "async"           0x00007fd5662dce2e in epoll_wait (epfd=70, events=events@entry=0x7fd27ada7760, maxevents=16, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
* 6    Thread 0x7fd2431dc000 (LWP 197338) "ucc-progress"    __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3,
    futex_word=0x7fd4e0ebee8c) at ./nptl/futex-internal.c:103
  7    Thread 0x7fd23f1db000 (LWP 197340) "test_multidevic" 0x00007fd5662de45f in __libc_accept (fd=179, addr=..., len=0x7fd23f1a57a0) at ../sysdeps/unix/sysv/linux/accept.c:26
  8    Thread 0x7fd23afda000 (LWP 197341) "pt_nccl_watchdg" __futex_abstimed_wait_common64 (private=2134846251, cancel=true, abstime=0x7fd23afa42c0, op=137, expected=0,
    futex_word=0x7fd4e0f11688) at ./nptl/futex-internal.c:57
  9    Thread 0x7fd2369ff000 (LWP 197342) "pt_nccl_heartbt" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7fd2369c95f0, op=137, expected=0, futex_word=0x7fd4e0f116b8)
    at ./nptl/futex-internal.c:57

backtrace of the main thread:

#0  0x00007fd4e42a043c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#0  0x00007fd4e42a043c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1  0x00007fd4e3f5368c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2  0x00007fd4e429ee48 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#3  0x00007fd4e400570f in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#4  0x00007fd4e3feb57a in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#5  0x00007fd4e3ff1b4d in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#6  0x00007fd4e4059574 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x00007fd56723bcc3 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#8  0x00007fd56723c410 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#9  0x00007fd56723c47e in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#10 0x00007fd56723f100 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#11 0x00007fd567215a4e in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#12 0x00007fd567275a73 in cudaLaunchKernel () from /usr/local/cuda/lib64/libcudart.so.12
#13 0x00007fd56905c535 in void at::native::gpu_kernel_impl_nocast<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#14 0x00007fd56904b5cb in at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&)::{lambda()#1}::operator()() const ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#15 0x00007fd56904d4f2 in at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#16 0x00007fd5945e99d5 in at::native::fill_out(at::Tensor&, c10::Scalar const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#17 0x00007fd56aaba5f1 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_Scalar_fill_(at::Tensor&, c10::Scalar const&) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#18 0x00007fd594e0872d in at::_ops::fill__Scalar::call(at::Tensor&, c10::Scalar const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#19 0x00007fd5945e9def in at::native::zero_(at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#20 0x00007fd56aab9309 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__zero_(at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#21 0x00007fd597b267cc in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (c10::DispatchKeySet, at::Tensor&), &torch::ADInplaceOrView::(anonymous namespace)::zero_>, at::Tensor&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor&> >, at::Tensor& (c10::DispatchKeySet, at::Tensor&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#22 0x00007fd5972d1e24 in torch::autograd::VariableType::(anonymous namespace)::zero_(c10::DispatchKeySet, at::Tensor&) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#23 0x00007fd595312253 in at::_ops::zero_::call(at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#24 0x00007fd5948ac9b3 in at::native::zeros_symint(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#25 0x00007fd5956e6fdb in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__zeros>, at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool> > >, at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#26 0x00007fd594dbd7a9 in at::_ops::zeros::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#27 0x00007fd59551f564 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>), &at::(anonymous namespace)::zeros>, at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool> > >, at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#28 0x00007fd594e2048f in at::_ops::zeros::call(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#29 0x00007fd568728002 in c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#30 0x00005626042121cc in nvfuser::Communicator::barrier (this=0x7fd4e0b2e460, backend=std::optional<nvfuser::CommunicatorBackend> [no contained value])
    at /opt/pytorch/Fuser_local/csrc/multidevice/communicator.cpp:308
#31 0x000056260462593a in nvfuser::MultiDeviceTest::~MultiDeviceTest (this=0x7fd4e0ddba20, __in_chrg=<optimized out>) at /opt/pytorch/Fuser_local/tests/cpp/multidevice.cpp:87
#32 0x000056260464c775 in nvfuser::CommunicationTest::~CommunicationTest (this=0x7fd4e0ddba20, __in_chrg=<optimized out>)
    at /opt/pytorch/Fuser_local/tests/cpp/test_multidevice_communications.cpp:23
#33 0x000056260465488f in nvfuser::CommunicationTest_SendRecv_Test::~CommunicationTest_SendRecv_Test (this=0x7fd4e0ddba20, __in_chrg=<optimized out>)
    at /opt/pytorch/Fuser_local/tests/cpp/test_multidevice_communications.cpp:208
#34 0x00005626046548b8 in nvfuser::CommunicationTest_SendRecv_Test::~CommunicationTest_SendRecv_Test (this=0x7fd4e0ddba20, __in_chrg=<optimized out>)
    at /opt/pytorch/Fuser_local/tests/cpp/test_multidevice_communications.cpp:208
#35 0x000056260478e2e2 in testing::Test::DeleteSelf_ (this=0x7fd4e0ddba20) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/include/gtest/gtest.h:336
#36 0x000056260479e66d in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x7fd4e0ddba20,
    method=(void (testing::Test::*)(testing::Test * const)) 0x56260478e2b4 <testing::Test::DeleteSelf_()>, location=0x562604c306df "the test fixture's destructor")
    at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2612
#37 0x0000562604797e05 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x7fd4e0ddba20,
    method=(void (testing::Test::*)(testing::Test * const)) 0x56260478e2b4 <testing::Test::DeleteSelf_()>, location=0x562604c306df "the test fixture's destructor")
    at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2648
#38 0x0000562604774121 in testing::TestInfo::Run (this=0x7fd4e3d1c8c0) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2842
#39 0x0000562604774acd in testing::TestSuite::Run (this=0x7fd4e3c9b9c0) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:3015
#40 0x0000562604784ff4 in testing::internal::UnitTestImpl::RunAllTests (this=0x7fd4e0d4e300) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:5920
#41 0x000056260479f608 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x7fd4e0d4e300,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x562604784bda <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0x562604c30fa0 "auxiliary test code (environments or event listeners)") at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2612
#42 0x0000562604798f09 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x7fd4e0d4e300,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x562604784bda <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0x562604c30fa0 "auxiliary test code (environments or event listeners)") at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2648
#43 0x00005626047835f5 in testing::UnitTest::Run (this=0x562605203800 <testing::UnitTest::GetInstance()::instance>)
    at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:5484
#44 0x0000562604627dcf in RUN_ALL_TESTS () at /opt/pytorch/Fuser_local/third_party/googletest/googletest/include/gtest/gtest.h:2317
#45 0x000056260462611c in main (argc=1, argv=0x7ffeb8a47258) at /opt/pytorch/Fuser_local/tests/cpp/multidevice.cpp:161
snordmann@viking-prod-237:/opt/pytorch/Fuser_local$

Backtrace of the ucc-progress thread:

#0  __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x7fd4e0ebee8c) at ./nptl/futex-internal.c:103
#1  __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x7fd4e0ebee8c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>)
    at ./nptl/futex-internal.c:128
#2  0x00007fd56625224f in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x7fd4e0ebee80) at ./nptl/pthread_rwlock_common.c:730
#3  ___pthread_rwlock_wrlock (rwlock=0x7fd4e0ebee80) at ./nptl/pthread_rwlock_wrlock.c:26
#4  0x00007fd4e42a0fd4 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#5  0x00007fd4e3f14c4e in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#6  0x00007fd4e407340c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x00007fd5100533bd in uct_cuda_ipc_iface_init_streams (iface=iface@entry=0x7fd2a1dd0000) at cuda_ipc/cuda_ipc_iface.c:400
#8  0x00007fd510053a2e in uct_cuda_ipc_post_cuda_async_copy (direction=0, comp=0x7fd25cf32d90, rkey=140540909625456, iov=0x7fd2431a61f0, remote_addr=140479922438144, tl_ep=<optimized out>)
    at cuda_ipc/cuda_ipc_ep.c:100
#9  uct_cuda_ipc_ep_put_zcopy (tl_ep=<optimized out>, iov=0x7fd2431a61f0, iovcnt=<optimized out>, remote_addr=140479922438144, rkey=140540909625456, comp=0x7fd25cf32d90)
    at cuda_ipc/cuda_ipc_ep.c:178
#10 0x00007fd55b58f437 in uct_ep_put_zcopy (comp=0x7fd25cf32d90, rkey=<optimized out>, remote_addr=<optimized out>, iovcnt=1, iov=0x7fd2431a61f0, ep=<optimized out>)
    at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/uct/api/uct.h:2915
#11 ucp_proto_rndv_put_common_send (comp=0x7fd25cf32d90, iov=0x7fd2431a61f0, lpriv=<optimized out>, req=0x7fd25cf32d00) at rndv/rndv_put.c:59
#12 ucp_proto_rndv_put_zcopy_send_func (lane_shift=<synthetic pointer>, next_iter=<synthetic pointer>, lpriv=<optimized out>, req=0x7fd25cf32d00) at rndv/rndv_put.c:363
#13 ucp_proto_multi_progress (dt_mask=1, complete_func=<optimized out>, send_func=<optimized out>, mpriv=0x7fd23aff63a0, req=0x7fd25cf32d00)
    at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_multi.inl:177
#14 ucp_proto_multi_zcopy_progress (uct_comp_cb=<optimized out>, complete_func=<optimized out>, send_func=<optimized out>, dt_mask=1, uct_mem_flags=256, init_func=<optimized out>,
    mpriv=0x7fd23aff63a0, req=0x7fd25cf32d00) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_multi.inl:246
#15 ucp_proto_rndv_put_zcopy_send_progress (uct_req=0x7fd25cf32dd8) at rndv/rndv_put.c:373
#16 0x00007fd55b5887eb in ucp_request_try_send (req=0x7fd25cf32d00) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:307
#17 ucp_request_send (req=<optimized out>) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:330
#18 ucp_proto_rndv_send_start (worker=<optimized out>, op_attr_mask=<optimized out>, rtr=<optimized out>, header_length=<optimized out>, sg_count=<optimized out>, req=<optimized out>)
    at rndv/proto_rndv.c:845
#19 ucp_proto_rndv_send_start (worker=<optimized out>, req=0x7fd25cf32d00, op_attr_mask=<optimized out>, rtr=<optimized out>, header_length=<optimized out>, sg_count=<optimized out>)
    at rndv/proto_rndv.c:820
#20 0x00007fd55b5889a1 in ucp_proto_rndv_handle_rtr (arg=0x7fd275772100, data=0x7fd2c4feeac0, length=<optimized out>, flags=<optimized out>) at rndv/proto_rndv.c:902
#21 0x00007fd566175eb9 in uct_iface_invoke_am (flags=1, length=<optimized out>, data=0x7fd2c4feeac0, id=<optimized out>, iface=0x7fd2c5952200)
    at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/uct/base/uct_iface.h:942
#22 uct_mm_iface_invoke_am (flags=1, length=<optimized out>, data=0x7fd2c4feeac0, am_id=<optimized out>, iface=0x7fd2c5952200) at sm/mm/base/mm_iface.h:278
#23 uct_mm_iface_process_recv (iface=0x7fd2c5952200) at sm/mm/base/mm_iface.c:321
#24 uct_mm_iface_poll_fifo (iface=0x7fd2c5952200) at sm/mm/base/mm_iface.c:353
#25 uct_mm_iface_progress (tl_iface=0x7fd2c5952200) at sm/mm/base/mm_iface.c:406
#26 0x00007fd55b56564a in ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucs/datastruct/callbackq.h:215
#27 uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/uct/api/uct.h:2787
#28 ucp_worker_progress (worker=0x7fd275772100) at core/ucp_worker.c:2996
#29 0x00007fd4ee742d81 in ucc_tl_ucp_test (task=0x7fd4db8d69c0) at bcast/../tl_ucp_coll.h:399
#30 ucc_tl_ucp_bcast_knomial_progress (coll_task=0x7fd4db8d69c0) at bcast/bcast_knomial.c:39
#31 0x00007fd5674d941e in ucc_pq_mt_progress (pq=0x7fd25cf16440) at core/ucc_progress_queue_mt.c:78
#32 0x00007fd5674d343d in ucc_progress_queue (pq=<optimized out>) at core/ucc_progress_queue.h:48
#33 ucc_context_progress (context=0x7fd2c53c5e80) at core/ucc_context.c:988
#34 0x00007fd56877aff3 in c10d::CommUCC::progress() () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#35 0x00007fd56876b5cd in c10d::Comm::progress_loop() () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#36 0x00007fd5664bc253 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#37 0x00007fd56624bac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#38 0x00007fd5662dd850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

samnordmann commented 1 week ago

@xwang233 I created a PR here but I am not managing to request your review. I might have done something wrong, let me know

https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/merge_requests/13

xwang233 commented 1 week ago

@xwang233 I created a PR here but I am not managing to request your review. I might have done something wrong, let me know

https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/merge_requests/13

LGTM. Thanks for the reminder. Feel free to cc me internally on MR in the future. 😄

samnordmann commented 1 day ago

POR ticket for the long term fix: https://redmine.mellanox.com/issues/3831841

NVIDIA / Fuser

CommunicationTest.SendRecv/UCC hangs. #3120

What

Temporary workaround

Long Term fix

Backtraces, for the record