NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.23k stars 815 forks source link

NCCL failure caused by NET/IB completion error #1405

Open thomasbarrett opened 2 months ago

thomasbarrett commented 2 months ago

I am experiencing occasional NCCL operation failures with caused by the following IB completion error. What is the root cause of this error? What steps should I take to reduce (or eliminate) the frequency of this error?

NET/IB : Got completion from peer with error 5, opcode 0, len 5384, vendor err 249 (Send)

Also, where can I find documentation for these error codes? I am assuming that the vendor errors are vendor specific. I am using a ConnectX-7 HCA SR-IOV VF.

thomasbarrett commented 2 months ago

Found a duplicate issue here, so feel free to just close. I will reach out to vendor for potential firmware fix. Curious if anyone has seen this error and actually had it resolved.

Xiaoaier-Z-L commented 2 months ago

I also encountered the same issue, and there was a vendor error 129 before vendor error 249.Error 129 indicates a timeout. I investigated ACS and found it turned off, and upon checking the VF traffic, no issues were detected, nor were there any logs indicating the PF was down. Eventually, I found in the host's system logs that the Pingmesh program ran out of memory (OOM) at the same time. We determined that the problem might be due to Pingmesh getting stuck when releasing RDMA resources after the OOM. We are currently monitoring the situation after fixing the OOM issue. Pingmesh is an agent+controller used to test the latency of the entire RDMA system of the switch in real time.

paras-genmo commented 1 month ago

I am also experiencing similar NCCL operation failures during distributed training on a cluster with H100 GPUs.

We receive the following InfiniBand (IB) completion errors:

NET/IB : Got completion from peer 10.1.x.x<port> with error 5, opcode 0, len xxxx, vendor err 249 (Send)

Here are some specific examples from our logs:

NET/IB : Got completion from peer 10.1.x.x<47617> with error 5, opcode 0, len 5373, vendor err 249 (Send)
NET/IB : Got completion from peer 10.1.x.x<56791> with error 5, opcode 0, len 5277, vendor err 249 (Send)
NET/IB : Got completion from peer 10.1.x.x<42803> with error 5, opcode 0, len 5380, vendor err 249 (Send)
NET/IB : Got completion from peer 10.1.x.x<51229> with error 5, opcode 0, len 5334, vendor err 249 (Send)

These errors are accompanied by NCCL failures:

NCCL error: remote process exited or there was a network error, NCCL version 2.21.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

We also observe NCCL watchdog timeouts:

ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives.

Our environment information:

GPUs: 8 x NVIDIA H100 80GB HBM3
GPU Driver Version: 535.183.01
CUDA Version: 12.2
NCCL Version: 2.21.5
IB Model: Mellanox ConnectX-5 (MT4126)
Firmware Version: 28.37.1014
Ports: 8 active ports at 400 Gbps (InfiniBand), 1 Ethernet port at 200 Gbps
Network Configuration: SR-IOV enabled, mlx5 Virtual Functions in use
Mellanox OFED Version: MLNX_OFED_LINUX-24.01-0.3.3.1
AddyLaddy commented 1 month ago

Whenever I see that error, my first suggestion is to disable ACS:

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs