NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.19k stars 805 forks source link

RuntimeError: NCCL error in ProcessGroupNCCL.cpp:290, unhandled system error #328

Open qysnn opened 4 years ago

qysnn commented 4 years ago

I came across this error RuntimeError: NCCL error in ProcessGroupNCCL.cpp:290, unhandled system error when trying to distribute neural network training to 4 GPUs in a single node with PyTorch 1.2. According to documentation, I tried export NCCL_DEBUG=INFO and reran the code. I noticed some weird warnings in the debug output like

node08:27106:27302 [0] include/shm.h:27 NCCL WARN Call to shm_open failed : Permission denied
node08:27106:27302 [0] NCCL INFO include/shm.h:41 -> 2
node08:27106:27302 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b7948f403864987a-0-3-0 (size 4460544)

before the crash. So I give the permission by sudo chmod 777 /dev/shm and the problem is gone. Maybe such error can be handled better than just raising a warning in debug output so when users run into an unhandled system error they know where the problem is.

sjeaugey commented 4 years ago

From NCCL's perspective, the only thing we can tell is that a call to shm_open failed. The reason for that failure can be varied, and we provide all the information we can get ("Permission denied" in this case, which I believe gave you a hint about what happened failed).

How shm_open is implemented is actually system dependent. From the shm_open man page :

NOTES
       The POSIX shared memory object implementation on Linux makes use of a
       dedicated tmpfs(5) filesystem that is normally mounted under
       /dev/shm.

So it is kind of hard for us to improve the reporting since it could lead users to a wrong path if that error happens for a different reason. Besides, instead of the current generic wrapper around the system call, we'd need to print a special error message when the error is EACCESS and the function is "shm_open" to give a hint that it might be due to /dev/shm not having proper permissions.

I guess we could have a dictionary of common errors which we go through to print a special hint, but I'd like to avoid having that in the code along with each call.

In any case, thanks for posting this issue with the resolution; it will certainly help anyone who encounters this issue !