samnordmann commented 4 months ago

What

It's been a long time that @cowanmeg and I are experiencing segfault during Teardown that we couldn't fix despite many attempts. I suspect that the Issue deals more with either a misusage of Process Groups in nvFuser or directly from core pytorch. If needed I can open an issue on pytorch repo, but I wanted to gather your opinions first.

@naoyam @xwang233 do you have an idea ?

The segfault can typically be reproduced on upstream/main branch by executing, e.g.,

mpirun -np 6 test_multidevice --gtest_filter=CommunicatorBackend/CommunicationTest.Communication_Gather/0

back trace looks like that:

  0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004f1f1 c10::cuda::CUDAKernelLaunchRegistry::has_failed()  ???:0
 2 0x00000000000501dd c10::cuda::c10_cuda_check_implementation()  ???:0
 3 0x00000000000508ca c10::cuda::ExchangeDevice()  ???:0
 4 0x0000000000ef8392 at::cuda::CUDAEvent::~CUDAEvent()  :0
 5 0x0000000000e7920a std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()  :0
 6 0x0000000000f5e809 c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL()  ???:0
 7 0x0000000000f5eea9 c10d::ProcessGroupNCCL::~ProcessGroupNCCL()  ???:0
 8 0x0000000000f5f29d c10d::ProcessGroupNCCL::~ProcessGroupNCCL()  ???:0
 9 0x00000000018cf17c std::default_delete<c10d::Backend>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
10 0x00000000018cddb6 std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
11 0x00000000018d28ae std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > >::~pair()  /usr/include/c++/11/bits/stl_pair.h:211
12 0x00000000018d28de __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > >()  /usr/include/c++/11/ext/new_allocator.h:168
13 0x00000000018d244f std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > >()  /usr/include/c++/11/bits/alloc_traits.h:535
14 0x00000000018d19b9 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > > >::_M_destroy_node()  /usr/include/c++/11/bits/stl_tree.h:623
15 0x00000000018d05c7 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > > >::_M_drop_node()  /usr/include/c++/11/bits/stl_tree.h:631
16 0x00000000018cf47b std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > > >::_M_erase()  /usr/include/c++/11/bits/stl_tree.h:1891
17 0x00000000018ce0b2 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > > >::~_Rb_tree()  /usr/include/c++/11/bits/stl_tree.h:984
18 0x00000000018cd548 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<c10d::Backend, std::default_delete<c10d::Backend> > > > >::~map()  /usr/include/c++/11/bits/stl_map.h:302
19 0x00000000018caa18 nvfuser::Communicator::~Communicator()  /opt/pytorch/Fuser/csrc/multidevice/communicator.cpp:217
20 0x00000000001b641c std::default_delete<nvfuser::Communicator>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
21 0x00000000001a972c std::unique_ptr<nvfuser::Communicator, std::default_delete<nvfuser::Communicator> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
22 0x00000000001e252e nvfuser::MultiDeviceEnvironment::~MultiDeviceEnvironment()  /opt/pytorch/Fuser/test/multidevice.h:19
23 0x00000000001e255a nvfuser::MultiDeviceEnvironment::~MultiDeviceEnvironment()  /opt/pytorch/Fuser/test/multidevice.h:19
24 0x00000000002bddc2 testing::internal::Delete<testing::Environment>()  /opt/pytorch/Fuser/third_party/googletest/googletest/src/gtest-internal-inl.h:351
25 0x00000000002d3949 std::for_each<__gnu_cxx::__normal_iterator<testing::Environment* const*, std::vector<testing::Environment*, std::allocator<testing::Environment*> > >, void (*)(testing::Environment*)>()  /usr/include/c++/11/bits/stl_algo.h:3820
26 0x00000000002cc3ff testing::internal::ForEach<std::vector<testing::Environment*, std::allocator<testing::Environment*> >, void (*)(testing::Environment*)>()  /opt/pytorch/Fuser/third_party/googletest/googletest/src/gtest-internal-inl.h:303
27 0x00000000002aed36 testing::internal::UnitTestImpl::~UnitTestImpl()  /opt/pytorch/Fuser/third_party/googletest/googletest/src/gtest.cc:5546
28 0x00000000002aee8c testing::internal::UnitTestImpl::~UnitTestImpl()  /opt/pytorch/Fuser/third_party/googletest/googletest/src/gtest.cc:5549
29 0x00000000002ae76e testing::UnitTest::~UnitTest()

There seems to be only "one good way" to use those process group, and I am not sure what it is.

Take aways

Some takeaways I want to share here:

destroy vs destruct pg ?

If I add the flag TORCH_NCCL_ABORT_IN_DESTROY_PG=1 then I'm hitting this warning message:

WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.3

then same segfault as before. I am not sure to understand the warning message. I took a look at the python function destroy_process_group and it seems to require many things to be done there. I don't understand why it is not encapsulated in the c++ class destructor.

watchDog and waitForPending Works

If in the Comm destructor, I call waitForPendingWorks() on all created ProcessGroupNCCL, the program hangs forever

previous error on ProcessGroupUCC create

Don't know if it is related, but @cowanmeg saw a strange segfault in #1266 about this line which was fixed by replacing c10::make_intrusive<::c10d::ProcessGroupUCC>(store, rank, size, pg_opts); with c10d::ProcessGroupUCC::createProcessGroupUCC(store, rank, size, timeout) Of course, this fix doesn't really makes sense... So Im mentionning it here just in case.

naoyam commented 4 months ago

I'm not familiar with ProcessGroup, but I wonder if there might be a CUDA error before this. Looks like the error happens at deleting a CUDA event, which may not be the original error.

naoyam commented 4 months ago

Try cuda-gdb as it can immediately detect CUDA errors.

samnordmann commented 4 months ago

cuda-gdb hangs foreover.

I also tried with NCCL debug log and I'm getting this:

[0] init.cc:1780 NCCL WARN Cuda failure 'driver shutting down'
[808530490] init.cc:1919 NCCL WARN commReclaim: comm 0x7f4eeb67f000 (rank = 0) in abort, error 1
[32590] NCCL INFO [Service thread] Connection closed by localRank 0
[32590] init.cc:1812 NCCL WARN Cuda failure 'driver shutting down'
[808530490] init.cc:1954 NCCL WARN commReclaim: cleanup comm 0x7f4eeb67f000 rank 0 failed in destroy/abort, error 1
[32590] NCCL INFO comm 0x7f4eeb67f000 rank 0 nranks 2 cudaDev 0 busId 6000 - Abort COMPLETE

samnordmann commented 4 months ago

Now that I am thinking about it, the problem may come from the fact that we initialize the testing environment (and thus the communicator) as a global variable. So we rely on the compiler's order for cleanup, which might cause an issue.

To change that, I would need to do

MultiDeviceEnvironment* multidevice_env = static_cast<MultiDeviceEnvironment*>(
    testing::AddGlobalTestEnvironment(new MultiDeviceEnvironment));

before RUN_ALL_TESTS() that is in gtest/main.cc and somehow pass the pointer multidevice_env to my unit test MultiDeviceTest. I don't know how to do it properly. I've been trying for the last hour but got really confused with static/extern decorator... ^^

samnordmann commented 4 months ago

Now that I am thinking about it, the problem may come from the fact that we initialize the testing environment (and thus the communicator) as a global variable. So we rely on the compiler's order for cleanup, which might cause an issue.

To change that, I would need to do
MultiDeviceEnvironment* multidevice_env = static_cast<MultiDeviceEnvironment*>(
    testing::AddGlobalTestEnvironment(new MultiDeviceEnvironment));
before RUN_ALL_TESTS() that is in gtest/main.cc and somehow pass the pointer multidevice_env to my unit test MultiDeviceTest. I don't know how to do it properly. I've been trying for the last hour but got really confused with static/extern decorator... ^^

@xwang233 do you have an idea ?

naoyam commented 4 months ago

What's the right order and what might be the current failing order?

xwang233 commented 4 months ago

Now that I am thinking about it, the problem may come from the fact that we initialize the testing environment (and thus the communicator) as a global variable. So we rely on the compiler's order for cleanup, which might cause an issue. To change that, I would need to do
MultiDeviceEnvironment* multidevice_env = static_cast<MultiDeviceEnvironment*>(
    testing::AddGlobalTestEnvironment(new MultiDeviceEnvironment));
before RUN_ALL_TESTS() that is in gtest/main.cc and somehow pass the pointer multidevice_env to my unit test MultiDeviceTest. I don't know how to do it properly. I've been trying for the last hour but got really confused with static/extern decorator... ^^
@xwang233 do you have an idea ?

I'm still looking into that and don't have suggestions for now.

samnordmann commented 4 months ago

What's the right order and what might be the current failing order?

I don't know how to figure this out... So I'm just guessing here. If the CUDA resource is shut down before NCCL is done, then we will have an error. This is suggested by the fact that NCCL is printing

[0] init.cc:1780 NCCL WARN Cuda failure 'driver shutting down'

naoyam commented 4 months ago

I'd still use cuda-gdb. It should not hang. You just need to let it run and come back a long time later.

cowanmeg commented 4 months ago

cuda-gdb hangs foreover.

I also tried with NCCL debug log and I'm getting this:

[0] init.cc:1780 NCCL WARN Cuda failure 'driver shutting down'
[808530490] init.cc:1919 NCCL WARN commReclaim: comm 0x7f4eeb67f000 (rank = 0) in abort, error 1
[32590] NCCL INFO [Service thread] Connection closed by localRank 0
[32590] init.cc:1812 NCCL WARN Cuda failure 'driver shutting down'
[808530490] init.cc:1954 NCCL WARN commReclaim: cleanup comm 0x7f4eeb67f000 rank 0 failed in destroy/abort, error 1
[32590] NCCL INFO comm 0x7f4eeb67f000 rank 0 nranks 2 cudaDev 0 busId 6000 - Abort COMPLETE

Try setting the heartbeat timeout to something lower like 30 seconds.TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=30

Somethings I tried/noticed:

In the hanging runs I only see 1 Abort COMPLETE line (NCCL INFO comm 0x7fb9818a5000 rank 2 nranks 3 cudaDev 2 busId a000 - Abort COMPLETE), but when teardown is successful I will see 1 Abort COMPLETE for each rank and (rank 2 nranks 6 cudaDev 2 busId a000 - Abort COMPLETE) and nranks=number of processes.
PipelineTest.Pipeline segfaults with UCC.
PipelineTestTwoStage tests are all passing with both UCC and NCCL and teardown is successful when there are 4 devices, but Gather/PipelineTestTwoStages.Communication/26 (UCC test) is hanging with 5+ devices. These tests only need 4 devices maximum, so I'm not sure why one is failing when there is an extra non-participating device.

NVIDIA / Fuser

Segfault during NCCL Process Group Tear Down #1820

What

Take aways

destroy vs destruct pg ?

watchDog and waitForPending Works

previous error on ProcessGroupUCC create