aws / aws-k8s-tester

AWS Kubernetes tester, kubetest2 deployer implementation
Apache License 2.0
163 stars 82 forks source link

Fix GetJobLogs and e2e-neuron binary not exits issue. #465

Closed weicongw closed 2 months ago

weicongw commented 2 months ago

Issue #, if available:

Description of changes:

Test:

Tested it will always print out the logs even timeout reached

go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia
2024/08/01 07:34:11 No node type specified. Using the node type p3.2xlarge in the node groups.
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
=== RUN   TestMPIJobPytorchTraining/single-node/MPIJob_succeeds
    mpi_test.go:60: context deadline exceeded
=== NAME  TestMPIJobPytorchTraining/single-node
    mpi_test.go:71: Test log for pytorch-training-single-node:
    mpi_test.go:72: Cloning into '/pytorch-examples'...
        Note: switching to '0f0c9131ca5c79d1332dce1f4c06fe942fbdc665'.

        You are in 'detached HEAD' state. You can look around, make experimental
        changes and commit them, and you can discard any commits you make in this
        state without impacting any branches by switching back to a branch.

        If you want to create a new branch to retain commits you create, you may
        do so (now or later) by using -c with the switch command. Example:

          git switch -c <new-branch-name>

        Or undo this operation with:

          git switch -

        Turn off this advice by setting config variable advice.detachedHead to false

        HEAD is now at 0f0c913 Use regular dropout rather than dropout2d
        Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden

        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 123835640.95it/s]
        Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

        Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden

        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 27751590.80it/s]
        Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

        Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden

        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 105775069.92it/s]
        Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

        Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
        Failed to download (trying next):
        HTTP Error 403: Forbidden

        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
        Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 3642548.52it/s]
        Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw

        Train Epoch: 1 [0/60000 (0%)]   Loss: 2.305400
        Train Epoch: 1 [640/60000 (1%)] Loss: 1.359780
        Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.830670
        Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.605961
        Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.345934
        Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.446331
        Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.306768
        Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.279325
        Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.555025
        Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.208878
        Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.279527
        Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.327207
        Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.204888
        Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.220855
        Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.273643
        Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.097318
        Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.248318
        Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.112893
        Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.439383
        Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.244582
        Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.245529
        Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.221483
        Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.157298
        Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.418896
        Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.168725
        Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.110782

=== RUN   TestMPIJobPytorchTraining/multi-node
W0801 07:34:58.583447   14852 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0801 07:34:58.583460   14852 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
=== RUN   TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds
    mpi_test.go:123: context deadline exceeded
=== NAME  TestMPIJobPytorchTraining/multi-node
    mpi_test.go:132: no pods found for job multi-node-nccl-test
--- FAIL: TestMPIJobPytorchTraining (47.59s)
    --- FAIL: TestMPIJobPytorchTraining/single-node (47.33s)
        --- FAIL: TestMPIJobPytorchTraining/single-node/MPIJob_succeeds (46.81s)
    --- FAIL: TestMPIJobPytorchTraining/multi-node (0.26s)
        --- FAIL: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (0.00s)
=== RUN   TestSingleNodeUnitTest
=== RUN   TestSingleNodeUnitTest/unit-test
=== RUN   TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds
    unit_test.go:56: context deadline exceeded
=== NAME  TestSingleNodeUnitTest/unit-test
    unit_test.go:65: container "unit-test-container" in pod "unit-test-job-v8wnm" is waiting to start: ContainerCreating
--- FAIL: TestSingleNodeUnitTest (0.32s)
    --- FAIL: TestSingleNodeUnitTest/unit-test (0.32s)
        --- FAIL: TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds (0.00s)
FAIL
FAIL    github.com/aws/aws-k8s-tester/e2e2/test/cases/nvidia    62.397s
FAIL

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.