Closed daoterog closed 8 months ago
Could it be you're running out of shared mem space in /dev/shm (or didn't provide enough shared mem in the container?).
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#shared-memory
Thanks for the quick reply @sjeaugey!
I am currently looking into how to change the shared memory in the container (I am using azure and they initialize the containers themselves). I noticed that whenever I set NCCL_DEBUG=WARN
the NCCL WARN Error: failed to extend /dev/shm/nccl...
warning/error never showed.
Could this be a shared memory issue even if this doesn't appear as it says in the documentation?
We improved the detection of the SHM exhaustion and the WARN message in NCCL 2.19.x
hello, i having the same problem as you when training my model in AzureML. Have you fixed this problem yet?
Yes! @PhamVietXuan
It actually was a shared memory issue and the distribution I was using. I discovered a shm_size
parameter in the command function that sets it when building the docker container. I tried using the --ulimit memlock=-1 as suggested in NVIDIA's troubleshooting page by passing it to the docker_args
argument of the command function, but it seems to be blocked. Nonetheless, setting shm_size=64g
, which was my total GPU memory, and changing the MPI distribution to Pytorch made the trick.
I am still not very sure what changed when I changed the distribution, but things are running smoothly!
command_job = command(
experiment_name="testing-ssl-byol",
description=description,
code=str(code_dir),
environment=enviornment,
inputs=inputs,
outputs=outputs,
command=job_command,
compute="Testing-GPU-Cluster",
######################################################
distribution=PyTorchDistribution(process_count_per_instance=4),
shm_size="64g",
######################################################
tags={"project": "ssl-research", "job-purpose": "testing"},
)
I am new to NCCL and multi-gpu training. My code ran perfectly on my Laptop's GPU (single RTX 3060) and it runs out of memory using four GPUs. I think it may be due to a misconfiguration of my GPUs or misuse of DDP strategy in Lightning. I hope someone can help me debug the log messages NCCL is leaving. Since they are very long, I'll paste here just the logs that come from the main rank of the process. I have experienced different errors that I think are related to memory. These are the ones I can track back:
OSError: [Errno 28] No space left on device
RuntimeError: cuDNN error: CUDNN_STATUS_ALLOC_FAILED
torch.cuda.OutOfMemoryError: CUDA out of memory.
RuntimeError: DataLoader worker (pid 4748) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The only time it gave a different error is when I manually set
NCCL_IB_DISABLE=0
. It gave me:As some additional info:
I am running a job a cluster with four Teslta T4 GPUs. Specifically, this cluster Standard_NC64as_T4_v3.
I have been using Azure Containers for Pytorch and installing additional dependencies as they recommend. Below I pasted the Dockerfiles I have been using to build my environments. I commented the second base image to avoid posting two different Dockerfiles with the same content.
I checked and the env using cuda 12.1 is using NCCL version 12.18.3 and the one using cuda 11.7 is using 12.17.1.
Also, I am specifying a distribution when launching the job using the
command
function. I understand that this will tell the system to use the four GPUs. Nonetheless, I experienced the same issue whenever I didn't specify the distribution in the command.Here are the log messages: