NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

it supports fast failure when RDMA write fails, #1434

Open alpha-baby opened 2 weeks ago

alpha-baby commented 2 weeks ago

log

:522:622 [2] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 10.1.15.233<60403> with error 4, opcode 32601, len 32600, vendor err 81 (Send) localGid ::ffff:10.1.77.5 remoteGid ::ffff:10.1.13.161
:4654:5666 [0] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 10.3.0.61<39430> with error 12, opcode 129, len 0, vendor err 129 (Recv) localGid ::ffff:10.3.16.3 remoteGid ::ffff:10.3.16.61

In our large model training, we found that if an RDMA error occurs, it leads to the entire training task hanging until the watchdog times out and exits, at which point the PyTorch framework will also exit. This ineffective waiting wastes a lot of time.

I suggest adding a switch to directly exit the program after an RDMA error occurs, thus avoiding unnecessary waiting. Because the process that exits abnormally will be captured by the training fault tolerance framework, it will automatically restart the training task.

sjeaugey commented 2 weeks ago

It should be the case. The timeout should propagate through ncclCommGetAsyncError, which should be trapped by the framework to stop the application.