NCCL initialization hangs with 4 GPUs, but works with 2 GPUs

mickaelseznec commented 2 months ago

Hi 👋 ,

When trying to run any NCCL application, it seems that it always hangs when running on more than 2 GPUs (see attached logs with NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL.

The command is executed within docker, on a 8xH100 machine. We've successfully tried simpleP2P so all GPUs seem to be working. The issue seems to be laying in NCCL (we're using 2.19.3).

Here are the log for 2 GPUs: all_reduce_2_gpus.txt. It completes successfully and I don't see anything concerning in the logs.

For 4 GPUs: all_reduce_4_gpus.txt, the program hangs indefinitely. Final log line for all GPUs is something like

c2ce1c877ea3:1023:1032 [0] NCCL INFO NVLS Bind mem 0xac0000000 UC handle 0x7f3310ccb240 MC handle 0x7f3310ccaa20 size 1073741824

We've tried increasing shmem size with --shm-size=1g --ulimit memlock=-1 and various env settings like NCCL_SHM_DISABLE=1 or NCCL_ALGO=Tree.

Do you have any idea where to look next?

Thanks a lot :)

sjeaugey commented 2 months ago

Can you try with NCCL_NVLS_ENABLE=0?

mickaelseznec commented 2 months ago

Thanks a lot @sjeaugey, the example is working now!

Any insights what the probable cause for NVLS not working? Looking at the docs, it seems that NCCL doesn't use NVLS when not available (and I also thought that setting NCCL_ALGO=Tree would disable NVLS as well).

sjeaugey commented 2 months ago

Ok thanks for confirming. But I'm not sure actually why NVLS Bind calls would hang. It's outside of our scope as those calls go to CUDA.

sjeaugey commented 2 months ago

Actually it could be because the fabricmanager service isn't running. Note that if you restart it, you may need to reset all GPUs to make NVLS functional again. Rebooting is usually the easiest option.

NVIDIA / nccl-tests

NCCL initialization hangs with 4 GPUs, but works with 2 GPUs #216