Open wujingyue opened 6 days ago
cc @samnordmann and @cowanmeg
~I'll try next to reproduce this with raw process group calls, just to isolate the problem.~
I added to the same branch a Python reproducer using PG directly. It doesn't hang but the timing is as weird as my last comment -- rank 1 doesn't synchronize except for the first iteration.
$ cat test_barrier.py
import logging
import os
import time
import torch.distributed as dist
def main():
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
assert world_size >= 3, "This test requires at least 3 GPUs."
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
for i in range(4):
time.sleep(rank)
logging.info(f"Rank {rank}: enter barrier {i}")
dist.barrier()
logging.info(f"Rank {rank}: exit barrier {i}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
$ TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=INFO mpirun -np 3 python test_barrier.py
...
2024-06-29 06:07:41,109 - Rank 0: enter barrier 0
2024-06-29 06:07:42,083 - Rank 1: enter barrier 0
2024-06-29 06:07:43,108 - Rank 2: enter barrier 0
2024-06-29 06:07:43,708 - Rank 2: exit barrier 0
2024-06-29 06:07:43,708 - Rank 1: exit barrier 0
2024-06-29 06:07:43,708 - Rank 0: exit barrier 0
2024-06-29 06:07:43,709 - Rank 0: enter barrier 1
2024-06-29 06:07:44,710 - Rank 1: enter barrier 1
2024-06-29 06:07:44,710 - Rank 1: exit barrier 1
2024-06-29 06:07:45,710 - Rank 2: enter barrier 1
2024-06-29 06:07:45,711 - Rank 0: exit barrier 1
2024-06-29 06:07:45,711 - Rank 0: enter barrier 2
2024-06-29 06:07:45,711 - Rank 2: exit barrier 1
2024-06-29 06:07:45,711 - Rank 1: enter barrier 2
2024-06-29 06:07:45,712 - Rank 1: exit barrier 2
2024-06-29 06:07:46,713 - Rank 1: enter barrier 3
2024-06-29 06:07:46,713 - Rank 1: exit barrier 3
2024-06-29 06:07:47,713 - Rank 2: enter barrier 2
2024-06-29 06:07:47,714 - Rank 2: exit barrier 2
2024-06-29 06:07:47,714 - Rank 0: exit barrier 2
2024-06-29 06:07:47,714 - Rank 0: enter barrier 3
2024-06-29 06:07:49,716 - Rank 2: enter barrier 3
2024-06-29 06:07:49,717 - Rank 2: exit barrier 3
2024-06-29 06:07:49,717 - Rank 0: exit barrier 3
...
Update: https://github.com/NVIDIA/Fuser/pull/2504 will fix the hang. However, the weird timing of entering/exiting the barrier mentioned in OP and https://github.com/NVIDIA/Fuser/issues/2499#issuecomment-2197771152 is still worrisome. Some ranks failed to wait for the barrier.
This is a bit worrisome...I vaguely recall seeing this error with ProcessGroupUCC, but it was under slightly different conditions (world size > test size) which is why so many of the tests are disabled.
tagging @eqy for help understanding https://github.com/NVIDIA/Fuser/issues/2499#issuecomment-2197771152
My guess is that this has something to do with what @Aidyn-A said regarding barrier()
having to guess what the current device is---seems like the current device is still "0" across all ranks with this run setup.
If I add a hack to force the current device to be the current rank, then things seem to line up:
import logging
import os
import time
import torch
import torch.distributed as dist
def main():
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
assert world_size >= 3, "This test requires at least 3 GPUs."
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
for i in range(4):
time.sleep(rank)
logging.info(f"Rank {rank}: enter barrier {i}")
dist.barrier()
logging.info(f"Rank {rank}: exit barrier {i}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
showing two example runs
root@4a34daa55aff:/nvdl/repro/distbarrier# ./run.sh run.py
[W701 20:51:43.751653082 socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
[W701 20:51:43.819483082 socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
2024-07-01 20:51:44,705 - Rank 0: enter barrier 0
2024-07-01 20:51:45,856 - Rank 1: enter barrier 0
2024-07-01 20:51:46,858 - Rank 2: enter barrier 0
2024-07-01 20:51:47,091 - Rank 1: exit barrier 0
2024-07-01 20:51:47,091 - Rank 2: exit barrier 0
2024-07-01 20:51:47,091 - Rank 0: exit barrier 0
2024-07-01 20:51:47,092 - Rank 0: enter barrier 1
2024-07-01 20:51:48,092 - Rank 1: enter barrier 1
2024-07-01 20:51:49,093 - Rank 2: enter barrier 1
2024-07-01 20:51:49,093 - Rank 1: exit barrier 1
2024-07-01 20:51:49,093 - Rank 0: exit barrier 1
2024-07-01 20:51:49,093 - Rank 2: exit barrier 1
2024-07-01 20:51:49,093 - Rank 0: enter barrier 2
2024-07-01 20:51:50,094 - Rank 1: enter barrier 2
2024-07-01 20:51:51,095 - Rank 2: enter barrier 2
2024-07-01 20:51:51,096 - Rank 1: exit barrier 2
2024-07-01 20:51:51,096 - Rank 0: exit barrier 2
2024-07-01 20:51:51,096 - Rank 2: exit barrier 2
2024-07-01 20:51:51,096 - Rank 0: enter barrier 3
2024-07-01 20:51:52,097 - Rank 1: enter barrier 3
2024-07-01 20:51:53,098 - Rank 2: enter barrier 3
2024-07-01 20:51:53,098 - Rank 1: exit barrier 3
2024-07-01 20:51:53,098 - Rank 0: exit barrier 3
2024-07-01 20:51:53,098 - Rank 2: exit barrier 3
root@4a34daa55aff:/nvdl/repro/distbarrier# ./run.sh run.py
[W701 20:52:50.480120952 socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
2024-07-01 20:52:52,054 - Rank 0: enter barrier 0
2024-07-01 20:52:53,247 - Rank 1: enter barrier 0
2024-07-01 20:52:53,389 - Rank 2: enter barrier 0
2024-07-01 20:52:53,618 - Rank 1: exit barrier 0
2024-07-01 20:52:53,618 - Rank 2: exit barrier 0
2024-07-01 20:52:53,619 - Rank 0: exit barrier 0
2024-07-01 20:52:53,619 - Rank 0: enter barrier 1
2024-07-01 20:52:54,620 - Rank 1: enter barrier 1
2024-07-01 20:52:55,621 - Rank 2: enter barrier 1
2024-07-01 20:52:55,621 - Rank 1: exit barrier 1
2024-07-01 20:52:55,621 - Rank 0: exit barrier 1
2024-07-01 20:52:55,621 - Rank 2: exit barrier 1
2024-07-01 20:52:55,621 - Rank 0: enter barrier 2
2024-07-01 20:52:56,622 - Rank 1: enter barrier 2
2024-07-01 20:52:57,623 - Rank 2: enter barrier 2
2024-07-01 20:52:57,623 - Rank 0: exit barrier 2
2024-07-01 20:52:57,624 - Rank 2: exit barrier 2
2024-07-01 20:52:57,624 - Rank 0: enter barrier 3
2024-07-01 20:52:57,624 - Rank 1: exit barrier 2
2024-07-01 20:52:58,625 - Rank 1: enter barrier 3
2024-07-01 20:52:59,626 - Rank 2: enter barrier 3
2024-07-01 20:52:59,626 - Rank 1: exit barrier 3
2024-07-01 20:52:59,626 - Rank 0: exit barrier 3
2024-07-01 20:52:59,626 - Rank 2: exit barrier 3
Better update: note that barrier
is not device-synchronizing, so logging is free to run ahead and print "exit" before the barrier actually completes:
https://pytorch.org/docs/stable/distributed.html#torch.distributed.barrier
There is another gotcha in that naively adding torch.cuda.synchronize()
won't help because without knowledge of what device to sync on it will sync on the default device which would also be incorrect.
So to actually get the intended behavior we need to sync on corresponding device before each print:
import logging
import os
import time
import torch
import torch.distributed as dist
def main():
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
assert world_size >= 3, "This test requires at least 3 GPUs."
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
for i in range(4):
time.sleep(rank)
logging.info(f"Rank {rank}: enter barrier {i}")
dist.barrier()
torch.cuda.synchronize(rank)
logging.info(f"Rank {rank}: exit barrier {i}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
Better update: note that barrier is not device-synchronizing, so logging is free to run ahead and print "exit" before the barrier actually completes
I must be missing something. I understood barrier is now stream-synchronizing instead of device-synchronizing. It ought to still block CPU and therefore the second logging.info
?
@wujingyue That's a good point---the comment is written in a very misleading way or there's a bug, as the allreduce
used in the barrier is done in a side ncclStream
so the streamSynchronize
on the current stream wouldn't actually cause the host to wait for the allreduce
...
I'll open a patch upstream to "fix" and see what kind of comments we get.
Thanks, Eddie!
A separate concern: I suspect this log may fail to print the actual device being used. guessDeviceForRank uses the bound device if it's available, which is not necessarily rank % numGpus
. I haven't checked whether it makes a difference in practice, but I think a more reliable way is to instead print out the result of guessDeviceForRank
.
@wujingyue I checked that function during debugging and it looks like it guesses "correctly" for now in that it matches what you were printing
I managed to narrow down a non-deterministic barrier hang into the following reliable reproducer.
To run this, check out
wjy/barrier
and_bn && python setup.py develop --debug && TORCH_CPP_LOG_LEVEL=INFO mpirun -np 3 bin/test_multidevice --gtest_filter=CommunicationTest.Barrier/NCCL
. Note two GPUs weren't enough to reproduce this, and therefore-np 3
.I printed out the time that each rank enters and exits the barrier:
Rank 0 hangs and eventually times out.
Observations: Rank 0 and rank 2 exits the same barrier at the same time, which is expected behavior. However, rank 1 is range free and doesn't synchronize except for the first iteration.
FWIW, some warnings got printed out at the beginning mentioning a potential hang. However, the guessed mapping seems to be correct so I doubt that's the reason.
FWIW, raw data for the table above: