aws / aws-k8s-tester

AWS Kubernetes tester, kubetest2 deployer implementation
Apache License 2.0
163 stars 82 forks source link

Integrate multi-node nccl testing into the tester package #447

Closed weicongw closed 4 months ago

weicongw commented 4 months ago

Issue #, if available:

Description of changes: This PR integrates multi-node NCCL testing into the tester package. The tester now accepts the following flags to configure the test:

The tester can retrieve the hardware specifications from the nodes and render the nccl test manifest based on these specifications.

Testing

go test -v . -args -efaImage 665181186642.dkr.ecr.us-west-2.amazonaws.com/aws-k8s-tester/nccl-test:latest -skip-features single-node -efaEnabled=true
W0610 22:55:51.199584   28686 warnings.go:70] spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key: beta.kubernetes.io/instance-type is deprecated since v1.17; use "node.kubernetes.io/instance-type" instead
W0610 22:55:51.199630   28686 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
2024/06/10 22:55:56 No node type specified. Using the node type p3dn.24xlarge in the node groups.
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
    env.go:438: Skipping feature: "single-node": name matched
=== RUN   TestMPIJobPytorchTraining/multi-node
=== RUN   TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds
--- PASS: TestMPIJobPytorchTraining (40.05s)
    --- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
    --- PASS: TestMPIJobPytorchTraining/multi-node (40.05s)
        --- PASS: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (40.01s)
PASS
ok      github.com/aws/aws-k8s-tester/e2e2/test/cases/nvidia    57.324s

Testing pod logs

...
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:           8             2     float     sum      -1    52.23    0.00    0.00      0    51.88    0.00    0.00      0
[1,0]<stdout>:          16             4     float     sum      -1    51.30    0.00    0.00      0    51.15    0.00    0.00      0
[1,0]<stdout>:          32             8     float     sum      -1    51.54    0.00    0.00      0    51.09    0.00    0.00      0
[1,0]<stdout>:          64            16     float     sum      -1    51.29    0.00    0.00      0    50.73    0.00    0.00      0
[1,0]<stdout>:         128            32     float     sum      -1    51.35    0.00    0.00      0    50.61    0.00    0.00      0
[1,0]<stdout>:         256            64     float     sum      -1    51.20    0.01    0.01      0    50.47    0.01    0.01      0
[1,0]<stdout>:         512           128     float     sum      -1    50.65    0.01    0.02      0    49.16    0.01    0.02      0
[1,0]<stdout>:        1024           256     float     sum      -1    52.06    0.02    0.03      0    51.37    0.02    0.03      0
[1,0]<stdout>:        2048           512     float     sum      -1    55.56    0.04    0.06      0    54.80    0.04    0.07      0
[1,0]<stdout>:        4096          1024     float     sum      -1    60.02    0.07    0.12      0    59.12    0.07    0.12      0
[1,0]<stdout>:        8192          2048     float     sum      -1    61.74    0.13    0.23      0    60.72    0.13    0.24      0
[1,0]<stdout>:       16384          4096     float     sum      -1    64.68    0.25    0.44      0    63.69    0.26    0.45      0
[1,0]<stdout>:       32768          8192     float     sum      -1    71.38    0.46    0.80      0    70.84    0.46    0.81      0
[1,0]<stdout>:       65536         16384     float     sum      -1    74.78    0.88    1.53      0    74.54    0.88    1.54      0
[1,0]<stdout>:      131072         32768     float     sum      -1    80.97    1.62    2.83      0    79.26    1.65    2.89      0
[1,0]<stdout>:      262144         65536     float     sum      -1    80.99    3.24    5.66      0    78.00    3.36    5.88      0
[1,0]<stdout>:      524288        131072     float     sum      -1    84.01    6.24   10.92      0    83.20    6.30   11.03      0
[1,0]<stdout>:     1048576        262144     float     sum      -1    92.30   11.36   19.88      0    91.67   11.44   20.02      0
[1,0]<stdout>:     2097152        524288     float     sum      -1    114.6   18.29   32.01      0    112.5   18.64   32.62      0
[1,0]<stdout>:     4194304       1048576     float     sum      -1    147.6   28.42   49.74      0    145.3   28.86   50.51      0
[1,0]<stdout>:     8388608       2097152     float     sum      -1    196.9   42.59   74.54      0    197.1   42.55   74.47      0
[1,0]<stdout>:    16777216       4194304     float     sum      -1    288.9   58.07  101.62      0    288.0   58.25  101.95      0
[1,0]<stdout>:    33554432       8388608     float     sum      -1    508.5   65.98  115.47      0    508.6   65.97  115.45      0
[1,0]<stdout>:    67108864      16777216     float     sum      -1    953.6   70.38  123.16      0    954.6   70.30  123.02      0
[1,0]<stdout>:   134217728      33554432     float     sum      -1   1857.3   72.26  126.46      0   1860.8   72.13  126.22      0
[1,0]<stdout>:   268435456      67108864     float     sum      -1   3666.6   73.21  128.12      0   3673.1   73.08  127.89      0
[1,0]<stdout>:   536870912     134217728     float     sum      -1   7286.3   73.68  128.94      0   7299.7   73.55  128.71      0
[1,0]<stdout>:  1073741824     268435456     float     sum      -1    14494   74.08  129.65      0    14515   73.98  129.46      0
[1,0]<stdout>:  2147483648     536870912     float     sum      -1    28866   74.40  130.19      0    28892   74.33  130.08      0
[1,0]<stdout>:multi-node-nccl-test-worker-0:21:21 [0] NCCL INFO comm 0x55f94529a0b0 rank 0 nranks 8 cudaDev 0 busId 160 - Destroy COMPLETE
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 40.7925 
[1,0]<stdout>:#
[1,1]<stdout>:multi-node-nccl-test-worker-0:22:22 [1] NCCL INFO comm 0x56151cbd4340 rank 1 nranks 8 cudaDev 1 busId 170 - Destroy COMPLETE
[1,2]<stdout>:multi-node-nccl-test-worker-0:23:23 [2] NCCL INFO comm 0x5556c80d7a40 rank 2 nranks 8 cudaDev 2 busId 180 - Destroy COMPLETE
[1,3]<stdout>:multi-node-nccl-test-worker-0:24:24 [3] NCCL INFO comm 0x55ddec7c3630 rank 3 nranks 8 cudaDev 3 busId 190 - Destroy COMPLETE
[1,7]<stdout>:multi-node-nccl-test-worker-0:31:31 [7] NCCL INFO comm 0x55df10ef2500 rank 7 nranks 8 cudaDev 7 busId 1d0 - Destroy COMPLETE
[1,5]<stdout>:multi-node-nccl-test-worker-0:26:26 [5] NCCL INFO comm 0x561b7c28f430 rank 5 nranks 8 cudaDev 5 busId 1b0 - Destroy COMPLETE
[1,6]<stdout>:multi-node-nccl-test-worker-0:29:29 [6] NCCL INFO comm 0x5648feb716d0 rank 6 nranks 8 cudaDev 6 busId 1c0 - Destroy COMPLETE
[1,4]<stdout>:multi-node-nccl-test-worker-0:25:25 [4] NCCL INFO comm 0x5569eef06e40 rank 4 nranks 8 cudaDev 4 busId 1a0 - Destroy COMPLETE

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.