Open szhengac opened 2 months ago
As the error message indicates, please rerun with NCCL_DEBUG=INFO
to get additional details. There are countless reasons why creating a communicator may fail and without the additional info it's not productive to speculate. In particular, it's not clear why you are even asking about /dev/shm
, given that it's not mentioned anywhere in the included output? In general though, shared memory is one of the transport layers used by NCCL for communication within a node, especially if direct point-to-point communication between the GPUs is not available.
nemo_sft_15044.err.log
nemo_sft_15044.out.log
@kiskra-nvidia Thanks for your response. I have attached the logs for this training test. The reason I asked about /dev/shm
is because I noticed that a lot of shared memory is used when there are a large number of p2p nccl torch distributed groups. CPU memory usage also goes up a lot. So I want to understand how NCCL utilize /dev/shm
and CPU memory. I believe GPU p2p direct connection is enabled, as shown in the attached log.
It looks like NCCL is running out of host memory because NCCL_WORK_FIFO_DEPTH
is set to 4194304
, which requires a 2GB allocation -- per process. So probably that's what's eating up your memory... The default value of this option is 65536
, I think...
I do not see NCCL_WORK_FIFO_DEPTH
in the NCCL official document. Is it a hidden environment variable? Also, does it scale linearly with the number of communicators?
I don't know its full history but it appears to be an internal variable. Not all NCCL variables are documented because some of them are meant primarily for our own development and debugging purposes and we may not want to support them long-term. Of course, anybody searching through the source code will find them though so I wouldn't exactly call them hidden :wink:. In case of NCCL_WORK_FIFO_DEPTH
in paticular, it's gone as of NCCL 2.22. And yes, it appears that the buffer is allocated for each communicator...
Ok thanks for the clarification. If it scales linearly with the number of communicators, it does explain something. I thought NCCL would add some sharing mechanism to reduce the memory usage.
In principle NCCL can share resources between communicators created using ncclCommSplit
with splitShare
set -- but it's something that has to be explicitly requested...
Hi,
I recently came across an issue when using context parallelism for splitting long sequence with NeMo and Transformer Engine. The context parallelism splits sequence length across GPUs and use p2p communications to implement a ring algorithm to accumulate the attention scores. The context parallelism would create a number of p2p pytorch communication groups.
I checked the line
2006
inProcessGroupNCCL.cpp
in the container. The error happens when pytorch tries to create a new communicator. With context parallelism and dynamic sequence length, the p2p collectives will operate on different tensor sizes across iterations. I am not sure if the above NCCL error is related. There are plenty of available GPU memory when the error occurred. Can you please explain how NCCL utilizes shared memory (/dev/shm)?