aws / aws-k8s-tester

AWS Kubernetes tester, kubetest2 deployer implementation
Apache License 2.0
163 stars 82 forks source link

Fix Nvidia Image build #489

Closed Issacwww closed 1 week ago

Issacwww commented 2 weeks ago

Issue #, if available:

Description of changes: Issue 1: Build failed in CodeBuild

Step 17/26 : ARG NCCL_VERSION=2.22.3-1+cuda${CUDA_MAJOR_VERSION}.${CUDA_MINOR_VERSION}
 ---> Running in 1a06f4ac470e
Removing intermediate container 1a06f4ac470e
 ---> 07cde702fb9b
Step 18/26 : RUN apt update   && apt install -y     libnccl2=${NCCL_VERSION}      libnccl-dev=${NCCL_VERSION}
...
E: Version '2.22.3-1' for 'libnccl2' was not found
E: Version '2.22.3-1' for 'libnccl-dev' was not found
The command '/bin/sh -c apt update   && apt install -y     libnccl2=${NCCL_VERSION}      libnccl-dev=${NCCL_VERSION}' returned a non-zero code: 100

temp fix by hardcode it

Issue 2: The unit test test_nvidia_persistence_status is failing on Bottlerocket as it is not enabled, there are an incoming release will fix it. But extending a flag to skip tests for flexibility

Testing with below

---
kind: Job
apiVersion: batch/v1
metadata:
  name: unit-test-job
  labels:
    app: unit-test-job
spec:
  template:
    metadata:
      labels:
        app: unit-test-job
    spec:
      containers:
        - name: unit-test-container
          image: "171391670848.dkr.ecr.us-west-2.amazonaws.com/test-images:nvtest-withSkip"
          command:
            - /bin/bash
            - ./gpu_unit_tests/unit_test
          env:
            - name: SKIP_TESTS_SUBCOMMAND
              value: "-s test_05_dcgm_diagnostics|test_nvidia_persistence_status"
          imagePullPolicy: Always
          resources:
            limits:
              cpu: "4"
              memory: 4Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "1"
              memory: 1Gi
      restartPolicy: Never
  backoffLimit: 4

Output

k logs unit-test-job-77xkh -f
# Running tests in gpu_unit_tests/tests/test_basic.sh
ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok -  # skip skip pattern: test_05_dcgm_diagnostics|test_nvidia_persistence_status
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:02 --:--:--     0
curl: (56) Recv failure: Connection reset by peer
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    10  100    10    0     0  13717      0 --:--:-- --:--:-- --:--:-- 10000
ok - test_numa_topo_topo
ok - test_nvidia_gpu_count
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok -  # skip skip pattern: test_05_dcgm_diagnostics|test_nvidia_persistence_status
ok - test_nvidia_smi_topo

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bryantbiggs commented 1 week ago

what is the Docker version in codebuild, or is it using something else like Finch to build the images?