NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.22k stars 810 forks source link

[Rem Allocator] Allocation failed #555

Open xutianming opened 3 years ago

xutianming commented 3 years ago

image

When training with A100 GPUs , I encountered OOM error of Rem Allocator. As far as I know, the remote allocator is only used to do send/receive between GPUs which do not have a direct NVLink connection but can communicate through an intermediate GPU.

But in my case, the GPUs are fully connected through NVLink. Why did I still meet this error? cc @sjeaugey

image

sjeaugey commented 3 years ago

Is that issue reproducible or did it only happen once? It seems the remote allocation thread is being sent a message that was not for it, hence experiences random errors: socket closing too soon, then first 64 bits sent -- the size -- being random data hence beyond the GPU memory size.

Would you have more of the log, in particular what happens for the main threads after they mistakenly communicate with the remote memory allocation thread? (main threads are the ones printing "NCCL INFO ... via P2P/IPC/read")

xutianming commented 3 years ago

@sjeaugey The issue rarely occurred, and I only met once. The main threads continued working without more logs. So the error can be ignored safely, right?

I am trying to reproduce it with NCCL_DEBUG_SUBSYS=INIT,ALLOC for more logs.

sjeaugey commented 3 years ago

No need to add ALLOC in the DEBUG_SUBSYS list, that would probably make it very large yet not more useful. Please keep NCCL_DEBUG_SUBSYS unset (default).

I'd like to see what the main thread was doing when it connected to the remote mem alloc thread, and see if there was any error after that that would indicate where it is. The only thing I see in your screenshot is what happened just before.

xutianming commented 3 years ago

image These are all the NCCL logs I got and the job continued without other errors. Since I got Connection closed by remote peer here, I guess there might be an unexpected failed socket/queue pair ?

sjeaugey commented 3 years ago

Okay, it is surprising that the job managed to continue without errors ... Now if that's not NCCL connecting to the remote allocator (like, some random other service connecting to the wrong port), the remote allocator will simply ignore the request and indeed everything else will work just fine...

chr1sj0nes commented 2 years ago

We have also seen spurious remote allocation requests. We're trying to understand the source of these, but I've sent a PR (#599) that should provide some basic protection against it.