Open samnordmann opened 4 months ago
I'm not familiar with ProcessGroup, but I wonder if there might be a CUDA error before this. Looks like the error happens at deleting a CUDA event, which may not be the original error.
Try cuda-gdb as it can immediately detect CUDA errors.
cuda-gdb hangs foreover.
I also tried with NCCL debug log and I'm getting this:
[0] init.cc:1780 NCCL WARN Cuda failure 'driver shutting down'
[808530490] init.cc:1919 NCCL WARN commReclaim: comm 0x7f4eeb67f000 (rank = 0) in abort, error 1
[32590] NCCL INFO [Service thread] Connection closed by localRank 0
[32590] init.cc:1812 NCCL WARN Cuda failure 'driver shutting down'
[808530490] init.cc:1954 NCCL WARN commReclaim: cleanup comm 0x7f4eeb67f000 rank 0 failed in destroy/abort, error 1
[32590] NCCL INFO comm 0x7f4eeb67f000 rank 0 nranks 2 cudaDev 0 busId 6000 - Abort COMPLETE
Now that I am thinking about it, the problem may come from the fact that we initialize the testing environment (and thus the communicator) as a global variable. So we rely on the compiler's order for cleanup, which might cause an issue.
To change that, I would need to do
MultiDeviceEnvironment* multidevice_env = static_cast<MultiDeviceEnvironment*>(
testing::AddGlobalTestEnvironment(new MultiDeviceEnvironment));
before RUN_ALL_TESTS()
that is in gtest/main.cc
and somehow pass the pointer multidevice_env
to my unit test MultiDeviceTest
. I don't know how to do it properly.
I've been trying for the last hour but got really confused with static/extern decorator... ^^
Now that I am thinking about it, the problem may come from the fact that we initialize the testing environment (and thus the communicator) as a global variable. So we rely on the compiler's order for cleanup, which might cause an issue.
To change that, I would need to do
MultiDeviceEnvironment* multidevice_env = static_cast<MultiDeviceEnvironment*>( testing::AddGlobalTestEnvironment(new MultiDeviceEnvironment));
before
RUN_ALL_TESTS()
that is ingtest/main.cc
and somehow pass the pointermultidevice_env
to my unit testMultiDeviceTest
. I don't know how to do it properly. I've been trying for the last hour but got really confused with static/extern decorator... ^^
@xwang233 do you have an idea ?
What's the right order and what might be the current failing order?
Now that I am thinking about it, the problem may come from the fact that we initialize the testing environment (and thus the communicator) as a global variable. So we rely on the compiler's order for cleanup, which might cause an issue. To change that, I would need to do
MultiDeviceEnvironment* multidevice_env = static_cast<MultiDeviceEnvironment*>( testing::AddGlobalTestEnvironment(new MultiDeviceEnvironment));
before
RUN_ALL_TESTS()
that is ingtest/main.cc
and somehow pass the pointermultidevice_env
to my unit testMultiDeviceTest
. I don't know how to do it properly. I've been trying for the last hour but got really confused with static/extern decorator... ^^@xwang233 do you have an idea ?
I'm still looking into that and don't have suggestions for now.
What's the right order and what might be the current failing order?
I don't know how to figure this out... So I'm just guessing here. If the CUDA resource is shut down before NCCL is done, then we will have an error. This is suggested by the fact that NCCL is printing
[0] init.cc:1780 NCCL WARN Cuda failure 'driver shutting down'
I'd still use cuda-gdb. It should not hang. You just need to let it run and come back a long time later.
cuda-gdb hangs foreover.
I also tried with NCCL debug log and I'm getting this:
[0] init.cc:1780 NCCL WARN Cuda failure 'driver shutting down' [808530490] init.cc:1919 NCCL WARN commReclaim: comm 0x7f4eeb67f000 (rank = 0) in abort, error 1 [32590] NCCL INFO [Service thread] Connection closed by localRank 0 [32590] init.cc:1812 NCCL WARN Cuda failure 'driver shutting down' [808530490] init.cc:1954 NCCL WARN commReclaim: cleanup comm 0x7f4eeb67f000 rank 0 failed in destroy/abort, error 1 [32590] NCCL INFO comm 0x7f4eeb67f000 rank 0 nranks 2 cudaDev 0 busId 6000 - Abort COMPLETE
Try setting the heartbeat timeout to something lower like 30 seconds.
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=30
Somethings I tried/noticed:
Gather/PipelineTestTwoStages.Communication/26
(UCC test) is hanging with 5+ devices. These tests only need 4 devices maximum, so I'm not sure why one is failing when there is an extra non-participating device.
What
It's been a long time that @cowanmeg and I are experiencing segfault during Teardown that we couldn't fix despite many attempts. I suspect that the Issue deals more with either a misusage of Process Groups in nvFuser or directly from core pytorch. If needed I can open an issue on pytorch repo, but I wanted to gather your opinions first.
@naoyam @xwang233 do you have an idea ?
The segfault can typically be reproduced on upstream/main branch by executing, e.g.,
back trace looks like that:
There seems to be only "one good way" to use those process group, and I am not sure what it is.
Take aways
Some takeaways I want to share here:
destroy vs destruct pg ?
If I add the flag
TORCH_NCCL_ABORT_IN_DESTROY_PG=1
then I'm hitting this warning message:then same segfault as before. I am not sure to understand the warning message. I took a look at the python function destroy_process_group and it seems to require many things to be done there. I don't understand why it is not encapsulated in the c++ class destructor.
watchDog and waitForPending Works
If in the Comm destructor, I call
waitForPendingWorks()
on all createdProcessGroupNCCL
, the program hangs foreverprevious error on ProcessGroupUCC create
Don't know if it is related, but @cowanmeg saw a strange segfault in #1266 about this line which was fixed by replacing
c10::make_intrusive<::c10d::ProcessGroupUCC>(store, rank, size, pg_opts);
withc10d::ProcessGroupUCC::createProcessGroupUCC(store, rank, size, timeout)
Of course, this fix doesn't really makes sense... So Im mentionning it here just in case.