aws / aws-k8s-tester

AWS Kubernetes tester, kubetest2 deployer implementation
Apache License 2.0
163 stars 82 forks source link

Verify GPU Direct RDMA is used on supported instance. #473

Closed weicongw closed 1 month ago

weicongw commented 1 month ago

Issue #, if available:

Description of changes:

Testing

NCCL test on bad AMI:

...
        [1,9]<stdout>:multi-node-nccl-test-worker-1:21:69 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/IPC
        [1,8]<stdout>:multi-node-nccl-test-worker-1:20:67 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/Socket/0
...
        [1,10]<stderr>:libfabric:22:1725400706::efa:domain:efa_domain_hmem_info_init_cuda():169<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
...
        [1,0]<stdout>:multi-node-nccl-test-worker-0:20:20 [0] NCCL INFO comm 0x55a496b65e90 rank 0 nranks 16 cudaDev 0 busId 53000 - Destroy COMPLETE
        [1,0]<stdout>:# Out of bounds values : 0 OK
        [1,0]<stdout>:# Avg bus bandwidth    : 1.00991 
        [1,0]<stdout>:#
...     
    mpi_test.go:137: GPU Direct RDMA is not utilized for inter-node communication in NCCL tests on instances that support GDRDMA: p5.48xlarge
--- FAIL: TestMPIJobPytorchTraining (751.57s)
    --- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
    --- FAIL: TestMPIJobPytorchTraining/multi-node (751.57s)
        --- FAIL: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (750.79s)

NCCL test on good AMI:

        [1,3]<stdout>:multi-node-nccl-test-worker-0:24:79 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains
        [1,7]<stdout>:multi-node-nccl-test-worker-0:33:74 [7] NCCL INFO Channel 09/0 : 7[7] -> 15[7] [send] via NET/AWS Libfabric/7/GDRDMA
...
        [1,0]<stdout>:# Out of bounds values : 0 OK
        [1,0]<stdout>:# Avg bus bandwidth    : 21.2232 
        [1,0]<stdout>:#
...

--- PASS: TestMPIJobPytorchTraining (271.88s)
    --- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
    --- PASS: TestMPIJobPytorchTraining/multi-node (271.88s)
        --- PASS: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (271.09s)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.