Open qysnn opened 4 years ago
From NCCL's perspective, the only thing we can tell is that a call to shm_open failed. The reason for that failure can be varied, and we provide all the information we can get ("Permission denied" in this case, which I believe gave you a hint about what happened failed).
How shm_open is implemented is actually system dependent. From the shm_open man page :
NOTES
The POSIX shared memory object implementation on Linux makes use of a
dedicated tmpfs(5) filesystem that is normally mounted under
/dev/shm.
So it is kind of hard for us to improve the reporting since it could lead users to a wrong path if that error happens for a different reason. Besides, instead of the current generic wrapper around the system call, we'd need to print a special error message when the error is EACCESS
and the function is "shm_open"
to give a hint that it might be due to /dev/shm
not having proper permissions.
I guess we could have a dictionary of common errors which we go through to print a special hint, but I'd like to avoid having that in the code along with each call.
In any case, thanks for posting this issue with the resolution; it will certainly help anyone who encounters this issue !
I came across this error
RuntimeError: NCCL error in ProcessGroupNCCL.cpp:290, unhandled system error
when trying to distribute neural network training to 4 GPUs in a single node with PyTorch 1.2. According to documentation, I triedexport NCCL_DEBUG=INFO
and reran the code. I noticed some weird warnings in the debug output likebefore the crash. So I give the permission by
sudo chmod 777 /dev/shm
and the problem is gone. Maybe such error can be handled better than just raising a warning in debug output so when users run into an unhandled system error they know where the problem is.