The GetJobLogs was using the same context as the main context. This would cause if time out happens the tests will not print out the logs from the tests. Let the GetJobLogs use context.Background() to let it always prints out the logs.
Add e2e-neuron test binary to the kubetest2 dockerfile.
Set the default disk storage size to 100 GB
Enforces the node group be in one AZ.
Test:
Tested it will always print out the logs even timeout reached
go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia
2024/08/01 07:34:11 No node type specified. Using the node type p3.2xlarge in the node groups.
=== RUN TestMPIJobPytorchTraining
=== RUN TestMPIJobPytorchTraining/single-node
=== RUN TestMPIJobPytorchTraining/single-node/MPIJob_succeeds
mpi_test.go:60: context deadline exceeded
=== NAME TestMPIJobPytorchTraining/single-node
mpi_test.go:71: Test log for pytorch-training-single-node:
mpi_test.go:72: Cloning into '/pytorch-examples'...
Note: switching to '0f0c9131ca5c79d1332dce1f4c06fe942fbdc665'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 0f0c913 Use regular dropout rather than dropout2d
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 123835640.95it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 27751590.80it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 105775069.92it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 3642548.52it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
Train Epoch: 1 [0/60000 (0%)] Loss: 2.305400
Train Epoch: 1 [640/60000 (1%)] Loss: 1.359780
Train Epoch: 1 [1280/60000 (2%)] Loss: 0.830670
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.605961
Train Epoch: 1 [2560/60000 (4%)] Loss: 0.345934
Train Epoch: 1 [3200/60000 (5%)] Loss: 0.446331
Train Epoch: 1 [3840/60000 (6%)] Loss: 0.306768
Train Epoch: 1 [4480/60000 (7%)] Loss: 0.279325
Train Epoch: 1 [5120/60000 (9%)] Loss: 0.555025
Train Epoch: 1 [5760/60000 (10%)] Loss: 0.208878
Train Epoch: 1 [6400/60000 (11%)] Loss: 0.279527
Train Epoch: 1 [7040/60000 (12%)] Loss: 0.327207
Train Epoch: 1 [7680/60000 (13%)] Loss: 0.204888
Train Epoch: 1 [8320/60000 (14%)] Loss: 0.220855
Train Epoch: 1 [8960/60000 (15%)] Loss: 0.273643
Train Epoch: 1 [9600/60000 (16%)] Loss: 0.097318
Train Epoch: 1 [10240/60000 (17%)] Loss: 0.248318
Train Epoch: 1 [10880/60000 (18%)] Loss: 0.112893
Train Epoch: 1 [11520/60000 (19%)] Loss: 0.439383
Train Epoch: 1 [12160/60000 (20%)] Loss: 0.244582
Train Epoch: 1 [12800/60000 (21%)] Loss: 0.245529
Train Epoch: 1 [13440/60000 (22%)] Loss: 0.221483
Train Epoch: 1 [14080/60000 (23%)] Loss: 0.157298
Train Epoch: 1 [14720/60000 (25%)] Loss: 0.418896
Train Epoch: 1 [15360/60000 (26%)] Loss: 0.168725
Train Epoch: 1 [16000/60000 (27%)] Loss: 0.110782
=== RUN TestMPIJobPytorchTraining/multi-node
W0801 07:34:58.583447 14852 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0801 07:34:58.583460 14852 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
=== RUN TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds
mpi_test.go:123: context deadline exceeded
=== NAME TestMPIJobPytorchTraining/multi-node
mpi_test.go:132: no pods found for job multi-node-nccl-test
--- FAIL: TestMPIJobPytorchTraining (47.59s)
--- FAIL: TestMPIJobPytorchTraining/single-node (47.33s)
--- FAIL: TestMPIJobPytorchTraining/single-node/MPIJob_succeeds (46.81s)
--- FAIL: TestMPIJobPytorchTraining/multi-node (0.26s)
--- FAIL: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (0.00s)
=== RUN TestSingleNodeUnitTest
=== RUN TestSingleNodeUnitTest/unit-test
=== RUN TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds
unit_test.go:56: context deadline exceeded
=== NAME TestSingleNodeUnitTest/unit-test
unit_test.go:65: container "unit-test-container" in pod "unit-test-job-v8wnm" is waiting to start: ContainerCreating
--- FAIL: TestSingleNodeUnitTest (0.32s)
--- FAIL: TestSingleNodeUnitTest/unit-test (0.32s)
--- FAIL: TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds (0.00s)
FAIL
FAIL github.com/aws/aws-k8s-tester/e2e2/test/cases/nvidia 62.397s
FAIL
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue #, if available:
Description of changes:
GetJobLogs
was using the same context as the main context. This would cause if time out happens the tests will not print out the logs from the tests. Let theGetJobLogs
usecontext.Background()
to let it always prints out the logs.e2e-neuron
test binary to the kubetest2 dockerfile.Test:
Tested it will always print out the logs even timeout reached
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.