:522:622 [2] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 10.1.15.233<60403> with error 4, opcode 32601, len 32600, vendor err 81 (Send) localGid ::ffff:10.1.77.5 remoteGid ::ffff:10.1.13.161
:4654:5666 [0] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 10.3.0.61<39430> with error 12, opcode 129, len 0, vendor err 129 (Recv) localGid ::ffff:10.3.16.3 remoteGid ::ffff:10.3.16.61
In our large model training, we found that if an RDMA error occurs, it leads to the entire training task hanging until the watchdog times out and exits, at which point the PyTorch framework will also exit. This ineffective waiting wastes a lot of time.
I suggest adding a switch to directly exit the program after an RDMA error occurs, thus avoiding unnecessary waiting. Because the process that exits abnormally will be captured by the training fault tolerance framework, it will automatically restart the training task.
log
In our large model training, we found that if an RDMA error occurs, it leads to the entire training task hanging until the watchdog times out and exits, at which point the PyTorch framework will also exit. This ineffective waiting wastes a lot of time.
I suggest adding a switch to directly exit the program after an RDMA error occurs, thus avoiding unnecessary waiting. Because the process that exits abnormally will be captured by the training fault tolerance framework, it will automatically restart the training task.