aws / aws-k8s-tester

AWS Kubernetes tester, kubetest2 deployer implementation
Apache License 2.0
163 stars 82 forks source link

Add BERT e2e training test #467

Open mattcjo opened 2 months ago

mattcjo commented 2 months ago

Issue #, if available:

Description of changes:

This test being added will run an E2E BERT training test. The validation for this test was done on a cluster consisting of p3.16xlarge instance type. The cluster has four nodes in total.

The results of running the training test can be seen below. These logs were obtained from the master pod that coordinated the E2E BERT training job.

[1,31]<stdout>:Process 31 - Training time: 10.09 seconds
[1,31]<stdout>:Process 31 - Throughput: 9.91 samples/second
[1,29]<stdout>:Process 29 - Training time: 10.05 seconds
[1,29]<stdout>:Process 29 - Throughput: 9.95 samples/second
[1,28]<stdout>:Process 28 - Training time: 10.09 seconds
[1,28]<stdout>:Process 28 - Throughput: 9.91 samples/second
[1,25]<stdout>:Process 25 - Training time: 10.04 seconds
[1,25]<stdout>:Process 25 - Throughput: 9.96 samples/second
[1,27]<stdout>:Process 27 - Training time: 10.10 seconds
[1,27]<stdout>:Process 27 - Throughput: 9.90 samples/second
[1,20]<stdout>:Process 20 - Training time: 10.09 seconds
[1,20]<stdout>:Process 20 - Throughput: 9.91 samples/second
[1,3]<stdout>:Process 3 - Training time: 10.07 seconds
[1,3]<stdout>:Process 3 - Throughput: 9.93 samples/second
[1,0]<stdout>:Process 0 - Training time: 10.03 seconds
[1,0]<stdout>:Process 0 - Throughput: 9.97 samples/second
[1,23]<stdout>:Process 23 - Training time: 10.04 seconds
[1,23]<stdout>:Process 23 - Throughput: 9.96 samples/second
[1,24]<stdout>:Process 24 - Training time: 10.10 seconds
[1,24]<stdout>:Process 24 - Throughput: 9.90 samples/second
[1,2]<stdout>:Process 2 - Training time: 10.14 seconds
[1,2]<stdout>:Process 2 - Throughput: 9.86 samples/second
[1,5]<stdout>:Process 5 - Training time: 10.08 seconds
[1,5]<stdout>:Process 5 - Throughput: 9.92 samples/second
[1,21]<stdout>:Process 21 - Training time: 10.08 seconds
[1,21]<stdout>:Process 21 - Throughput: 9.92 samples/second
[1,22]<stdout>:Process 22 - Training time: 10.07 seconds
[1,22]<stdout>:Process 22 - Throughput: 9.93 samples/second
[1,30]<stdout>:Process 30 - Training time: 10.09 seconds
[1,30]<stdout>:Process 30 - Throughput: 9.91 samples/second
[1,1]<stdout>:Process 1 - Training time: 10.07 seconds
[1,1]<stdout>:Process 1 - Throughput: 9.93 samples/second
[1,17]<stdout>:Process 17 - Training time: 10.11 seconds
[1,17]<stdout>:Process 17 - Throughput: 9.89 samples/second
[1,12]<stdout>:Process 12 - Training time: 10.01 seconds
[1,12]<stdout>:Process 12 - Throughput: 9.99 samples/second
[1,6]<stdout>:Process 6 - Training time: 10.04 seconds
[1,6]<stdout>:Process 6 - Throughput: 9.96 samples/second
[1,18]<stdout>:Process 18 - Training time: 10.12 seconds
[1,18]<stdout>:Process 18 - Throughput: 9.88 samples/second
[1,7]<stdout>:Process 7 - Training time: 10.11 seconds
[1,7]<stdout>:Process 7 - Throughput: 9.89 samples/second
[1,15]<stdout>:Process 15 - Training time: 10.14 seconds
[1,15]<stdout>:Process 15 - Throughput: 9.86 samples/second
[1,19]<stdout>:Process 19 - Training time: 10.12 seconds
[1,19]<stdout>:Process 19 - Throughput: 9.89 samples/second
[1,14]<stdout>:Process 14 - Training time: 9.96 seconds
[1,14]<stdout>:Process 14 - Throughput: 10.04 samples/second
[1,13]<stdout>:Process 13 - Training time: 10.05 seconds
[1,13]<stdout>:Process 13 - Throughput: 9.95 samples/second
[1,16]<stdout>:Process 16 - Training time: 10.10 seconds
[1,16]<stdout>:Process 16 - Throughput: 9.90 samples/second
[1,26]<stdout>:Process 26 - Training time: 10.11 seconds
[1,26]<stdout>:Process 26 - Throughput: 9.89 samples/second
[1,10]<stdout>:Process 10 - Training time: 10.12 seconds
[1,10]<stdout>:Process 10 - Throughput: 9.88 samples/second
[1,11]<stdout>:Process 11 - Training time: 10.10 seconds
[1,11]<stdout>:Process 11 - Throughput: 9.90 samples/second
[1,8]<stdout>:Process 8 - Training time: 10.09 seconds
[1,8]<stdout>:Process 8 - Throughput: 9.91 samples/second
[1,4]<stdout>:Process 4 - Training time: 10.05 seconds
[1,4]<stdout>:Process 4 - Throughput: 9.95 samples/second
[1,9]<stdout>:Process 9 - Training time: 10.08 seconds
[1,9]<stdout>:Process 9 - Throughput: 9.92 samples/second

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.