Open lixuehui08 opened 2 years ago
two vm‘s topo as belows:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X NV1 NV2 NV1 SYS NV2 SYS SYS SYS SYS 0-39 0
GPU1 NV1 X NV1 NV2 NV2 SYS SYS SYS SYS SYS 0-39 0
GPU2 NV2 NV1 X NV2 SYS SYS SYS NV1 SYS SYS 0-39 0
GPU3 NV1 NV2 NV2 X SYS SYS NV1 SYS SYS SYS 0-39 0
GPU4 SYS NV2 SYS SYS X NV1 NV2 NV1 NODE PIX 40-79 1
GPU5 NV2 SYS SYS SYS NV1 X NV1 NV2 NODE PIX 40-79 1
GPU6 SYS SYS SYS NV1 NV2 NV1 X NV2 PIX NODE 40-79 1
GPU7 SYS SYS NV1 SYS NV1 NV2 NV2 X PIX NODE 40-79 1
mlx5_0 SYS SYS SYS SYS NODE NODE PIX PIX X NODE
mlx5_1 SYS SYS SYS SYS PIX PIX NODE NODE NODE X
What do you mean by "pcie acs capability not emulated"? GPU Direct RDMA can't really work within a VM without ACS. Can you confirm ACS is enabled and functional within the VM?
@sjeaugey Thanks for your answer~ We Enabled the ACS within the vm as belows: ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- But,the nccl tests error still exists,and nothing changes in the error stack。
While,we found an error in kernel logs as belows: Apr 2 14:23:11 vm2 kernel: nvidia-uvm: Loaded the UVM driver, major device number 233. Apr 2 14:23:36 vm2 kernel: NVRM: GPU at PCI:0000:63:00: GPU-7cfd2d7b-ed7a-5d3c-5d24-35a7bb9fc9c9 Apr 2 14:23:36 vm2 kernel: NVRM: GPU Board Serial Number: 1560121002972 Apr 2 14:23:36 vm2 kernel: NVRM: Xid (PCI:0000:63:00): 31, pid=2783, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f0f_0e000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE Apr 2 14:23:36 vm2 kernel: NVRM: GPU at PCI:0000:45:00: GPU-bdd6153f-0f03-4a81-4476-dd6b366afc77 Apr 2 14:23:36 vm2 kernel: NVRM: GPU Board Serial Number: 1560121003491 Apr 2 14:23:36 vm2 kernel: NVRM: Xid (PCI:0000:45:00): 31, pid=2783, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f0f_4c001000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
Which direction should we work in next? Thanks~
I'm not a GPU Direct / ACS expert unfortunately. All I know is that you need ACS to work for GPU Direct RDMA to be functional within the VM.
You can always disable GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) to verify this is indeed the issue. But beyond that, I'm not sure how I can help.
We tested GDR in baremetal with nccl successfully as belows (pcie acs has been prohibited):
But when tested this in virtual machines,we encounted this error as belows (pcie acs capability not emulated):
and:
What is the reason for this issue ? Thanks~