Open mickaelseznec opened 2 months ago
Can you try with NCCL_NVLS_ENABLE=0
?
Thanks a lot @sjeaugey, the example is working now!
Any insights what the probable cause for NVLS not working? Looking at the docs, it seems that NCCL doesn't use NVLS when not available (and I also thought that setting NCCL_ALGO=Tree
would disable NVLS as well).
Ok thanks for confirming. But I'm not sure actually why NVLS Bind calls would hang. It's outside of our scope as those calls go to CUDA.
Actually it could be because the fabricmanager service isn't running. Note that if you restart it, you may need to reset all GPUs to make NVLS functional again. Rebooting is usually the easiest option.
Hi 👋 ,
When trying to run any NCCL application, it seems that it always hangs when running on more than 2 GPUs (see attached logs with
NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL
.The command is executed within docker, on a 8xH100 machine. We've successfully tried simpleP2P so all GPUs seem to be working. The issue seems to be laying in NCCL (we're using 2.19.3).
Here are the log for 2 GPUs: all_reduce_2_gpus.txt. It completes successfully and I don't see anything concerning in the logs.
For 4 GPUs: all_reduce_4_gpus.txt, the program hangs indefinitely. Final log line for all GPUs is something like
We've tried increasing shmem size with
--shm-size=1g --ulimit memlock=-1
and various env settings likeNCCL_SHM_DISABLE=1
orNCCL_ALGO=Tree
.Do you have any idea where to look next?
Thanks a lot :)