GDR Test with nccl in Virtual Machine encounts error

lixuehui08 commented 2 years ago

We tested GDR in baremetal with nccl successfully as belows (pcie acs has been prohibited)：

instance-ubm6ko9y:163876:163916 [4] NCCL INFO Channel 02/0 : 13[b2000] -> 4[b1000] [receive] via NET/IB/0/GDRDMA
instance-ubm6ko9y:163876:163918 [6] NCCL INFO Connection to proxy localRank 6 -> connection 0x7fa0f0000af8
instance-ubm6ko9y:163876:163930 [5] NCCL INFO NET/IB: Dev 0 Port 1 qpn 1839 mtu 5 LID 5
instance-o0hxpzzr:151270:151321 [0] NCCL INFO New proxy recv connection 5 from 172.16.32.5<52098>, transport 0
instance-ubm6ko9y:163876:163918 [6] NCCL INFO Channel 03/0 : 15[dc000] -> 6[da000] [receive] via NET/IB/1/GDRDMA
instance-ubm6ko9y:163876:163919 [7] NCCL INFO Connection to proxy localRank 7 -> connection 0x7fa0f8000af8
instance-ubm6ko9y:163876:163915 [3] NCCL INFO Channel 02 : 3[62000] -> 2[60000] via P2P/direct pointer
instance-ubm6ko9y:163876:163934 [3] NCCL INFO New proxy send connection 7 from 172.16.32.4<37756>, transport 0
instance-ubm6ko9y:163876:163913 [1] NCCL INFO Channel 03 : 1[1c000] -> 0[1b000] via P2P/direct pointer
instance-ubm6ko9y:163876:163931 [1] NCCL INFO New proxy send connection 7 from 172.16.32.4<53610>, transport 0
instance-ubm6ko9y:163876:163919 [7] NCCL INFO Channel 03/0 : 7[dc000] -> 14[da000] [send] via NET/IB/1/GDRDMA
instance-o0hxpzzr:151270:151306 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7fe3a8000af8
instance-ubm6ko9y:163876:163913 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7fa108000b68
and:
      131072         32768     float     sum    129.6    1.01    1.90  2e-07    129.9    1.01    1.89  2e-07
      262144         65536     float     sum    144.9    1.81    3.39  2e-07    144.4    1.82    3.40  2e-07
      524288        131072     float     sum    174.8    3.00    5.62  2e-07    171.6    3.05    5.73  2e-07
     1048576        262144     float     sum    200.4    5.23    9.81  2e-07    200.9    5.22    9.78  2e-07
     2097152        524288     float     sum    256.5    8.18   15.33  2e-07    254.9    8.23   15.42  2e-07
     4194304       1048576     float     sum    373.4   11.23   21.06  2e-07    336.5   12.46   23.37  2e-07

But when tested this in virtual machines，we encounted this error as belows (pcie acs capability not emulated)：

vm1:58934:58944 [0] NCCL INFO Channel 00/0 : 1[84000] -> 0[84000] [send] via NET/IB/0/GDRDMA
vm1:58934:58944 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 84000 / HCA 0 (distance 3 <= 4), read 1
vm1:58934:58945 [0] NCCL INFO New proxy send connection 11 from 192.168.0.12<55744>, transport 2
vm2:15364:15374 [0] NCCL INFO Channel 09/0 : 1[84000] -> 0[84000] [receive] via NET/IB/0/GDRDMA
vm2:15364:15374 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 84000 / HCA 0 (distance 3 <= 4), read 1
vm1:58934:58944 [0] NCCL INFO Channel 07/0 : 1[84000] -> 0[84000] [send] via NET/IB/0/GDRDMA
vm1:58934:58944 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 84000 / HCA 0 (distance 3 <= 4), read 1
vm2:15364:15374 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7fc02c000e10
vm1:58934:58945 [0] NCCL INFO New proxy send connection 18 from 192.168.0.12<55744>, transport 2
vm2:15364:15374 [0] NCCL INFO Channel 06/0 : 0[84000] -> 1[84000] [send] via NET/IB/0/GDRDMA

and：

vm2:15364:15375 [0] NCCL INFO transport/net.cc:377 Cuda Alloc Size 33554432 pointer 0x7fbe9c000000
vm2:15364:15364 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fbfac000000 recvbuff 0x7fbf2c000000 count 536870912 datatype 7 op 0 root 0 comm 0x7fc030000d00 [nranks=2] stream 0x4de7c00
vm2:15364:15364 [0] NCCL INFO Launch mode Parallel
vm2:15364:15364 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fbfac000000 recvbuff 0x7fbf2c000000 count 536870912 datatype 7 op 0 root 0 comm 0x7fc030000d00 [nranks=2] stream 0x4de7c00
vm2:15364:15364 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fbfac000000 recvbuff 0x7fbf2c000000 count 536870912 datatype 7 op 0 root 0 comm 0x7fc030000d00 [nranks=2] stream 0x4de7c00
vm2:15364:15364 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fbfac000000 recvbuff 0x7fbf2c000000 count 536870912 datatype 7 op 0 root 0 comm 0x7fc030000d00 [nranks=2] stream 0x4de7c00
vm2:15364:15364 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fbfac000000 recvbuff 0x7fbf2c000000 count 536870912 datatype 7 op 0 root 0 comm 0x7fc030000d00 [nranks=2] stream 0x4de7c00
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:657 Ib Alloc Size 21688 pointer 0x7f15f8268000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:671 Ib Alloc Size 552 pointer 0x7f15f826f000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:746 Ib Alloc Size 552 pointer 0x7f15f826f000
vm1:58934:58945 [0] NCCL INFO transport/net.cc:669 Cuda Alloc Size 9633792 pointer 0x7f15eea00000
vm1:58934:58945 [0] NCCL INFO transport/net.cc:673 Cuda Host Alloc Size 8192 pointer 0x7f15ff594000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:657 Ib Alloc Size 21688 pointer 0x7f15f828b000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:671 Ib Alloc Size 552 pointer 0x7f15f8292000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:746 Ib Alloc Size 552 pointer 0x7f15f8292000
vm1:58934:58945 [0] NCCL INFO transport/net.cc:669 Cuda Alloc Size 9633792 pointer 0x7f15ef400000
vm1:58934:58945 [0] NCCL INFO transport/net.cc:673 Cuda Host Alloc Size 8192 pointer 0x7f15ff596000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:657 Ib Alloc Size 21688 pointer 0x7f15f82ae000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:671 Ib Alloc Size 552 pointer 0x7f15f82b5000
vm1:58934:58945 [0] NCCL INFO transport/net_ib.cc:746 Ib Alloc Size 552 pointer 0x7f15f82b5000
vm1:58934:58945 [0] NCCL INFO transport/net.cc:669 Cuda Alloc Size 9633792 pointer 0x7f15ec000000
vm1:58934:58945 [0] NCCL INFO transport/net.cc:673 Cuda Host Alloc Size 8192 pointer 0x7f15ff598000
vm1:58934:58944 [0] NCCL INFO Connected all rings
vm1:58934:58944 [0] NCCL INFO Connected all trees
vm1:58934:58944 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
vm1:58934:58944 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 8 pointer 0x7f1606606400
vm1:58934:58944 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1824 pointer 0x7f1606606600
vm1:58934:58944 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f15eca00000
vm1:58934:58944 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 8 pointer 0x7f1606606e00
vm1:58934:58944 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1824 pointer 0x7f1606607000
vm1:58934:58944 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f15ecb00000
vm1:58934:58944 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 8 pointer 0x7f1606607800
vm1:58934:58944 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1824 pointer 0x7f1606607a00
vm1:58934:58944 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f15ecc00000
vm1:58934:58944 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 8 pointer 0x7f1606608200
vm1:58934:58944 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1824 pointer 0x7f1606608400
vm1:58934:58944 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f15ecd00000
vm1:58934:58944 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 8 pointer 0x7f1606608c00
vm1:58934:58944 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1824 pointer 0x7f1606608e00
vm1:58934:58944 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f15ece00000
vm1:58934:58944 [0] NCCL INFO channel.cc:20 Cuda Alloc Size 8 pointer 0x7f1606609600
vm1:58934:58944 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1824 pointer 0x7f1606609800
vm1:58934:58944 [0] NCCL INFO channel.cc:34 Cuda Host Alloc Size 1048576 pointer 0x7f15ecf00000
vm1:58934:58944 [0] NCCL INFO 10 coll channels, 16 p2p channels, 2 p2p channels per peer
vm1:58934:58945 [0] NCCL INFO New proxy send connection 20 from 192.168.0.12<55744>, transport 2
vm1:58934:58944 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f15f8000e40
vm1:58934:58944 [0] NCCL INFO init.cc:273 Cuda Alloc Size 16424 pointer 0x7f160660a000
vm1:58934:58944 [0] NCCL INFO comm 0x7f1600000d00 rank 1 nranks 2 cudaDev 0 busId 84000 - Init COMPLETE
vm1:58934:58934 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f170e000000 recvbuff 0x7f168e000000 count 536870912 datatype 7 op 0 root 0 comm 0x7f1600000d00 [nranks=2] stream 0x2b01040
vm1:58934:58945 [0] NCCL INFO transport/net.cc:377 Cuda Alloc Size 33554432 pointer 0x7f15ea000000
vm1:58934:58934 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f170e000000 recvbuff 0x7f168e000000 count 536870912 datatype 7 op 0 root 0 comm 0x7f1600000d00 [nranks=2] stream 0x2b01040
vm1:58934:58934 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f170e000000 recvbuff 0x7f168e000000 count 536870912 datatype 7 op 0 root 0 comm 0x7f1600000d00 [nranks=2] stream 0x2b01040
vm1:58934:58934 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f170e000000 recvbuff 0x7f168e000000 count 536870912 datatype 7 op 0 root 0 comm 0x7f1600000d00 [nranks=2] stream 0x2b01040
vm1:58934:58934 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f170e000000 recvbuff 0x7f168e000000 count 536870912 datatype 7 op 0 root 0 comm 0x7f1600000d00 [nranks=2] stream 0x2b01040

vm2:15364:15376 [0] transport/net_ib.cc:1192 NCCL WARN NET/IB : Got completion from peer 192.168.0.12<60054> with error 4, opcode 32704, len 0, vendor err 81
vm2:15364:15376 [0] NCCL INFO include/net.h:32 -> 2
vm2:15364:15376 [0] NCCL INFO transport/net.cc:870 -> 2
vm2:15364:15376 [0] NCCL INFO proxy.cc:494 -> 2
vm2:15364:15376 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread]
vm2:15364:15364 [0] NCCL INFO Created 1 queue info, destroyed 1

vm1:58934:58946 [0] transport/net_ib.cc:1192 NCCL WARN NET/IB : Got completion from peer 192.168.0.19<54698> with error 4, opcode 32533, len 32535, vendor err 81
vm1:58934:58946 [0] NCCL INFO include/net.h:32 -> 2
vm1:58934:58946 [0] NCCL INFO transport/net.cc:870 -> 2
vm1:58934:58946 [0] NCCL INFO proxy.cc:494 -> 2
vm1:58934:58946 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread]
vm1:58934:58934 [0] NCCL INFO Created 1 queue info, destroyed 1
[vm2.com:15359] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[vm2.com:15359] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
vm1:58934:58934 [0] NCCL INFO comm 0x7f1600000d00 rank 1 nranks 2 cudaDev 0 busId 84000 - Abort COMPLETE
vm1: Test NCCL failure common.cu:499 'unhandled system error'
 .. vm1 pid 58934: Test failure common.cu:587
 .. vm1 pid 58934: Test failure common.cu:766
 .. vm1 pid 58934: Test failure all_reduce.cu:103
 .. vm1 pid 58934: Test failure common.cu:792
 .. vm1 pid 58934: Test failure common.cu:1166
 .. vm1 pid 58934: Test failure common.cu:1007
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
vm2:15364:15364 [0] NCCL INFO comm 0x7fc030000d00 rank 0 nranks 2 cudaDev 0 busId 84000 - Abort COMPLETE
vm2: Test NCCL failure common.cu:499 'unhandled system error'
 .. vm2 pid 15364: Test failure common.cu:587
 .. vm2 pid 15364: Test failure common.cu:766
 .. vm2 pid 15364: Test failure all_reduce.cu:103
 .. vm2 pid 15364: Test failure common.cu:792
 .. vm2 pid 15364: Test failure common.cu:1166
 .. vm2 pid 15364: Test failure common.cu:1007
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[59005,1],1]
  Exit code:    3

What is the reason for this issue ? Thanks~

lixuehui08 commented 2 years ago

two vm‘s topo as belows：

GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X  NV1 NV2 NV1 SYS NV2 SYS SYS SYS SYS 0-39    0
GPU1    NV1  X  NV1 NV2 NV2 SYS SYS SYS SYS SYS 0-39    0
GPU2    NV2 NV1  X  NV2 SYS SYS SYS NV1 SYS SYS 0-39    0
GPU3    NV1 NV2 NV2  X  SYS SYS NV1 SYS SYS SYS 0-39    0
GPU4    SYS NV2 SYS SYS  X  NV1 NV2 NV1 NODE    PIX 40-79   1
GPU5    NV2 SYS SYS SYS NV1  X  NV1 NV2 NODE    PIX 40-79   1
GPU6    SYS SYS SYS NV1 NV2 NV1  X  NV2 PIX NODE    40-79   1
GPU7    SYS SYS NV1 SYS NV1 NV2 NV2  X  PIX NODE    40-79   1
mlx5_0  SYS SYS SYS SYS NODE    NODE    PIX PIX  X  NODE        
mlx5_1  SYS SYS SYS SYS PIX PIX NODE    NODE    NODE     X

sjeaugey commented 2 years ago

What do you mean by "pcie acs capability not emulated"? GPU Direct RDMA can't really work within a VM without ACS. Can you confirm ACS is enabled and functional within the VM?

lixuehui08 commented 2 years ago

@sjeaugey Thanks for your answer~ We Enabled the ACS within the vm as belows: ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- But，the nccl tests error still exists，and nothing changes in the error stack。

While，we found an error in kernel logs as belows： Apr 2 14:23:11 vm2 kernel: nvidia-uvm: Loaded the UVM driver, major device number 233. Apr 2 14:23:36 vm2 kernel: NVRM: GPU at PCI:0000:63:00: GPU-7cfd2d7b-ed7a-5d3c-5d24-35a7bb9fc9c9 Apr 2 14:23:36 vm2 kernel: NVRM: GPU Board Serial Number: 1560121002972 Apr 2 14:23:36 vm2 kernel: NVRM: Xid (PCI:0000:63:00): 31, pid=2783, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f0f_0e000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE Apr 2 14:23:36 vm2 kernel: NVRM: GPU at PCI:0000:45:00: GPU-bdd6153f-0f03-4a81-4476-dd6b366afc77 Apr 2 14:23:36 vm2 kernel: NVRM: GPU Board Serial Number: 1560121003491 Apr 2 14:23:36 vm2 kernel: NVRM: Xid (PCI:0000:45:00): 31, pid=2783, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f0f_4c001000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE

Which direction should we work in next? Thanks~

sjeaugey commented 2 years ago

I'm not a GPU Direct / ACS expert unfortunately. All I know is that you need ACS to work for GPU Direct RDMA to be functional within the VM.

You can always disable GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) to verify this is indeed the issue. But beyond that, I'm not sure how I can help.

NVIDIA / nccl

GDR Test with nccl in Virtual Machine encounts error #660