Open xutianming opened 3 years ago
Is that issue reproducible or did it only happen once? It seems the remote allocation thread is being sent a message that was not for it, hence experiences random errors: socket closing too soon, then first 64 bits sent -- the size -- being random data hence beyond the GPU memory size.
Would you have more of the log, in particular what happens for the main threads after they mistakenly communicate with the remote memory allocation thread? (main threads are the ones printing "NCCL INFO ... via P2P/IPC/read")
@sjeaugey The issue rarely occurred, and I only met once. The main threads continued working without more logs. So the error can be ignored safely, right?
I am trying to reproduce it with NCCL_DEBUG_SUBSYS=INIT,ALLOC for more logs.
No need to add ALLOC
in the DEBUG_SUBSYS list, that would probably make it very large yet not more useful. Please keep NCCL_DEBUG_SUBSYS unset (default).
I'd like to see what the main thread was doing when it connected to the remote mem alloc thread, and see if there was any error after that that would indicate where it is. The only thing I see in your screenshot is what happened just before.
These are all the NCCL logs I got and the job continued without other errors. Since I got Connection closed by remote peer
here, I guess there might be an unexpected failed socket/queue pair ?
Okay, it is surprising that the job managed to continue without errors ... Now if that's not NCCL connecting to the remote allocator (like, some random other service connecting to the wrong port), the remote allocator will simply ignore the request and indeed everything else will work just fine...
We have also seen spurious remote allocation requests. We're trying to understand the source of these, but I've sent a PR (#599) that should provide some basic protection against it.
When training with A100 GPUs , I encountered OOM error of
Rem Allocator
. As far as I know, the remote allocator is only used to do send/receive between GPUs which do not have a direct NVLink connection but can communicate through an intermediate GPU.But in my case, the GPUs are fully connected through NVLink. Why did I still meet this error? cc @sjeaugey