Verify GPU Direct RDMA is used on supported instance.
Update the nvidia and efa plugin to the latest.
Correct the persistence status of P5 instance
Testing
NCCL test on bad AMI:
...
[1,9]<stdout>:multi-node-nccl-test-worker-1:21:69 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/IPC
[1,8]<stdout>:multi-node-nccl-test-worker-1:20:67 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/Socket/0
...
[1,10]<stderr>:libfabric:22:1725400706::efa:domain:efa_domain_hmem_info_init_cuda():169<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
...
[1,0]<stdout>:multi-node-nccl-test-worker-0:20:20 [0] NCCL INFO comm 0x55a496b65e90 rank 0 nranks 16 cudaDev 0 busId 53000 - Destroy COMPLETE
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth : 1.00991
[1,0]<stdout>:#
...
mpi_test.go:137: GPU Direct RDMA is not utilized for inter-node communication in NCCL tests on instances that support GDRDMA: p5.48xlarge
--- FAIL: TestMPIJobPytorchTraining (751.57s)
--- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
--- FAIL: TestMPIJobPytorchTraining/multi-node (751.57s)
--- FAIL: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (750.79s)
NCCL test on good AMI:
[1,3]<stdout>:multi-node-nccl-test-worker-0:24:79 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains
[1,7]<stdout>:multi-node-nccl-test-worker-0:33:74 [7] NCCL INFO Channel 09/0 : 7[7] -> 15[7] [send] via NET/AWS Libfabric/7/GDRDMA
...
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth : 21.2232
[1,0]<stdout>:#
...
--- PASS: TestMPIJobPytorchTraining (271.88s)
--- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
--- PASS: TestMPIJobPytorchTraining/multi-node (271.88s)
--- PASS: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (271.09s)
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue #, if available:
Description of changes:
Testing
NCCL test on bad AMI:
NCCL test on good AMI:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.