NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 829 forks source link

local access violation work queue error when upgrade to v2.20.3-1 #1524

Open gangxie112 opened 4 days ago

gangxie112 commented 4 days ago

hello,

I'm doing the nccl tests with my mlx 455 NIC. And find that after I upgrade the nccl version to v2.20.3-1, the test is broken with the following errors. all the earlier versions are OK. Is there any breaking change related to RDMA in this version? How to use the latest version? Any compatibility here?

`ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error

ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error

ubuntu20-server-2:94639:94652 [0] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 192.168.1.2<57888> with status=5 opcode=1 len=0 vendor err 249 (Recv) localGid fe80::e61d:2dff:fef2:9c94 remoteGidsfe80::e61d:2dff:fef2:9fa0 ubuntu20-server-2:94639:94652 [0] NCCL INFO transport/net.cc:1298 -> 6 ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:694 -> 6 ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]`

My NIC: CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.26.1040 Hardware version: 0 Node GUID: 0xe41d2d0300f29fa0 System image GUID: 0xe41d2d0300f29fa0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffef29fa0 Link layer: Ethernet

sjeaugey commented 3 days ago

It's a bit weird, but maybe older versions were not using GPU Direct RDMA, and more recent ones are trying to use it. And GPU Direct RDMA is broken because ACS is enabled on your system?

Can you set NCCL_NET_GDR_LEVEL=0 and see if it makes the problem disappear?

gangxie112 commented 3 days ago

NCCL_NET_GDR_LEVEL=0

  1. with NCCL_NET_GDR_LEVEL=0 still not work
  2. upgrade the mlx driver to latest one, not work
  3. Dumped the traffice and compared the one which could work, found there is no global route header when not work. not sure it's related.

Not work Frame 278: 1098 bytes on wire (8784 bits), 1098 bytes captured (8784 bits) Ethernet II, Src: MellanoxTech_f2:9f:a0 (e4:1d:2d:f2:9f:a0), Dst: MellanoxTech_f2:9c:94 (e4:1d:2d:f2:9c:94) Internet Protocol Version 4, Src: 192.168.1.2, Dst: 192.168.1.3 User Datagram Protocol, Src Port: 58239, Dst Port: 4791 InfiniBand Base Transport Header Opcode: Reliable Connection (RC) - RDMA WRITE First (6) 0... .... = Solicited Event: False .1.. .... = MigReq: True ..00 .... = Pad Count: 0 .... 0000 = Header Version: 0 Partition Key: 65535 Reserved: 00 Destination Queue Pair: 0x00033e 0... .... = Acknowledge Request: False .000 0000 = Reserved (7 bits): 0 Packet Sequence Number: 0 RETH - RDMA Extended Transport Header Virtual Address: 0x00007f4c05930000 Remote Key: 0x0018006c DMA Length: 524288 (0x00080000) Invariant CRC: 0x3c94db23 Data (1024 bytes)

Work Frame 279: 1110 bytes on wire (8880 bits), 1110 bytes captured (8880 bits) Ethernet II, Src: MellanoxTech_f2:9f:a0 (e4:1d:2d:f2:9f:a0), Dst: MellanoxTech_f2:9c:94 (e4:1d:2d:f2:9c:94) InfiniBand Global Route Header 0110 .... = IP Version: 6 .... 0000 0010 .... = Traffic Class: 2 .... .... .... 0000 0000 0000 0000 0000 = Flow Label: 0 Payload Length: 1056 Next Header: 27 Hop Limit: 255 Source GID: fe80::e61d:2dff:fef2:9fa0 Destination GID: fe80::e61d:2dff:fef2:9c94 Base Transport Header Opcode: Reliable Connection (RC) - RDMA WRITE First (6) 0... .... = Solicited Event: False .1.. .... = MigReq: True ..00 .... = Pad Count: 0 .... 0000 = Header Version: 0 Partition Key: 65535 Reserved: 00 Destination Queue Pair: 0x0003a6 0... .... = Acknowledge Request: False .000 0000 = Reserved (7 bits): 0 Packet Sequence Number: 0 RETH - RDMA Extended Transport Header Virtual Address: 0x00007ff407930000 Remote Key: 0x00180084 DMA Length: 524288 (0x00080000) Invariant CRC: 0xbc200433 Data (1024 bytes)

gangxie112 commented 3 days ago

Attach the log for more information: nccl.log

sjeaugey commented 3 days ago

Hum, maybe it's tied to ECE being added in 2.19, for which your system is misconfigured.

Could you try with NCCL 2.23 and set NCCL_ECE_ENABLE=0?

sjeaugey commented 3 days ago

Looks like query_ece seems to think it is supported here:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 430 mtu 3 query_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0} GID 0 (80FE/949CF2FEFF2D1DE6) fifoRkey=0x178c7a fifoLkey=0x178c7a

But then it fails:

ubuntu20-server-2:3782:3793 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

So later we see ECE not being supported:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: IbDev 0 Port 1 qpn 419 set_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0}

Not sure what's happening exactly but it looks like a probable root cause.

gangxie112 commented 3 days ago

Looks like query_ece seems to think it is supported here:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 430 mtu 3 query_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0} GID 0 (80FE/949CF2FEFF2D1DE6) fifoRkey=0x178c7a fifoLkey=0x178c7a

But then it fails:

ubuntu20-server-2:3782:3793 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

So later we see ECE not being supported:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: IbDev 0 Port 1 qpn 419 set_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0}

Not sure what's happening exactly but it looks like a probable root cause.

Not the root cause. I noticed the failure at the first beginning. So I compared it with the one which worked and found this message still exist. The attached log is the one which worked.nccl-ok.log

gangxie112 commented 3 days ago

The GID difference in the log is not a issue either. I tried all the GIDs,

gangxie112 commented 2 days ago

After track down the commit, find b6475625fbcaa2c3c0e50eed2fa1255d7514d4a2 (the older one is ok b6d7438d3145a619f924dbbca6c96db21fab716e) introduced the issue. @sjeaugey, this commit seems have a lot of changes, any idea about the possible cause?

gangxie112 commented 2 days ago

@sjeaugey I think I find the root cause after review the diff between the 2 commits mentioned above. There is a change not to adjust the mtu. My 2 servers were misconfigured with different MTU. After correct this, it works.

So why not to adjust the mtu? Suggest to log a warn after find the difference at least.

-  // Adjust the MTU
-  remQpInfo.mtu = (enum ibv_mtu)std::min(remQpInfo.mtu, portAttr.active_mtu);
+  // Copy remDevInfo for things like remGidInfo, remFifoAddr, etc.
+  for (int i = 0; i < remMeta.ndevs; i++) {
+    rComm->base.remDevs[i] = remMeta.devs[i];
+    rComm->base.remDevs[i].remoteGid.global.interface_id  = rComm->base.remDevs[i].iid;
+    rComm->base.remDevs[i].remoteGid.global.subnet_prefix = rComm->base.remDevs[i].spn;
+  }