Description of changes:
This PR integrates multi-node NCCL testing into the tester package. The tester now accepts the following flags to configure the test:
ncclTestImage: Specifies the base image to run the multi-node nccl test.
efaEnabled: Determines whether to use the EFA in the cluster.
nodeType: Specifies what type of nodes in the node groups will be used to run the multi-node NCCL test.
The tester can retrieve the hardware specifications from the nodes and render the nccl test manifest based on these specifications.
Testing
go test -v . -args -efaImage 665181186642.dkr.ecr.us-west-2.amazonaws.com/aws-k8s-tester/nccl-test:latest -skip-features single-node -efaEnabled=true
W0610 22:55:51.199584 28686 warnings.go:70] spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key: beta.kubernetes.io/instance-type is deprecated since v1.17; use "node.kubernetes.io/instance-type" instead
W0610 22:55:51.199630 28686 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
2024/06/10 22:55:56 No node type specified. Using the node type p3dn.24xlarge in the node groups.
=== RUN TestMPIJobPytorchTraining
=== RUN TestMPIJobPytorchTraining/single-node
env.go:438: Skipping feature: "single-node": name matched
=== RUN TestMPIJobPytorchTraining/multi-node
=== RUN TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds
--- PASS: TestMPIJobPytorchTraining (40.05s)
--- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
--- PASS: TestMPIJobPytorchTraining/multi-node (40.05s)
--- PASS: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (40.01s)
PASS
ok github.com/aws/aws-k8s-tester/e2e2/test/cases/nvidia 57.324s
Issue #, if available:
Description of changes: This PR integrates multi-node NCCL testing into the tester package. The tester now accepts the following flags to configure the test:
The tester can retrieve the hardware specifications from the nodes and render the nccl test manifest based on these specifications.
Testing
Testing pod logs
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.