NVIDIA / cloud-native-stack

Run cloud native workloads on NVIDIA GPUs
Apache License 2.0
118 stars 47 forks source link

Failed to setup NVIDIA Cloud Native Stack v11.0 for Developers #48

Closed yin19941005 closed 5 months ago

yin19941005 commented 5 months ago

Hello,

I am trying to setup NVIDIA Cloud Native Stack v11.0 for Developers on AWS with Ubuntu v22.04. I followed the install guide but the GPU Operator doesn't start correctly.

Platform: AWS Instance type: g4dn.4xlarge OS: Ubuntu v22.04 (AMI: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20231207)

Once finished the Installing GPU Operator and start verifying the GPU Operator and I got the following error message:

ubuntu@ip-172-31-34-87:~$ helm -n nvidia-gpu-operator ls
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator-1702604969 nvidia-gpu-operator     1               2023-12-15 01:49:30.08757447 +0000 UTC  deployed        gpu-operator-v23.9.1    v23.9.1

The gpu-operator is not functioning:

ubuntu@ip-172-31-34-87:~$ kubectl -n nvidia-gpu-operator get all
NAME                                                                  READY   STATUS                  RESTARTS        AGE
pod/gpu-feature-discovery-5hcg5                                       0/1     Init:0/1                0               16m
pod/gpu-operator-1702604969-node-feature-discovery-gc-f7bc979f2cs2l   1/1     Running                 0               16m
pod/gpu-operator-1702604969-node-feature-discovery-master-86c8f2rwx   1/1     Running                 0               16m
pod/gpu-operator-1702604969-node-feature-discovery-worker-pth4p       1/1     Running                 0               16m
pod/gpu-operator-75fb9db9fd-92ljm                                     1/1     Running                 0               16m
pod/nvidia-dcgm-exporter-vb9lm                                        0/1     Init:0/1                0               16m
pod/nvidia-device-plugin-daemonset-fmv95                              0/1     Init:0/1                0               16m
pod/nvidia-operator-validator-cst7k                                   0/1     Init:CrashLoopBackOff   7 (4m54s ago)   16m

I tried to launch another instance but it ends with same result. Then I trying to print the logs:

ubuntu@ip-172-31-47-239:~$ kubectl logs -n nvidia-gpu-operator pod/nvidia-container-toolkit-daemonset-wp6kj
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Error from server (BadRequest): container "nvidia-container-toolkit-ctr" in pod "nvidia-container-toolkit-daemonset-wp6kj" is terminated
ubuntu@ip-172-31-47-239:~$
ubuntu@ip-172-31-47-239:~$ kubectl logs -n nvidia-gpu-operator pod/nvidia-operator-validator-45xxh
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-45xxh" is terminated

I had followed another install guide for Ubuntu Server which does not install CUDA driver and it works. What had I missed on the guide? I tried to match the CUDA driver (the version that nvidia-smi print out) on running the helm install command for the GPU Operator but it doesn't help as well.

angudadevops commented 5 months ago

@yin19941005 can you run provide the below commands output

nvidia-smi
kubectl describe pod  -n nvidia-gpu-operator nvidia-operator-validator-cst7k

Did you reboot the server after you install the cuda/nvidia driver ?

yin19941005 commented 5 months ago

Hello @angudadevops,

Yes, I did reboot the server after installing the Nvidia driver. And I have the nvidia-smi output:

ubuntu@ip-172-31-12-242:~$ nvidia-smi
Sun Jan 21 07:15:41 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   24C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The kubectl describe pod output:

ubuntu@ip-172-31-12-242:~$ kubectl describe pod  -n nvidia-gpu-operator nvidia-operator-validator-tmr65
Name:                 nvidia-operator-validator-tmr65
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-operator-validator
Node:                 ip-172-31-12-242/172.31.12.242
Start Time:           Sun, 21 Jan 2024 07:34:11 +0000
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=656bd5c76b
                      helm.sh/chart=gpu-operator-v23.9.1
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: d74744821e914b24284adae9122a0ed9d3ef96b94b8ae4995628a0107b6f3ca6
                      cni.projectcalico.org/podIP: 192.168.34.69/32
                      cni.projectcalico.org/podIPs: 192.168.34.69/32
Status:               Pending
IP:                   192.168.34.69
IPs:
  IP:           192.168.34.69
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  cri-o://d53e893e8365259509658518c37c6288eb5ecb1de270ba0bb516afb00ed410d1
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 21 Jan 2024 07:37:06 +0000
      Finished:     Sun, 21 Jan 2024 07:37:07 +0000
    Ready:          False
    Restart Count:  5
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
  toolkit-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           nvidia-gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           nvidia-gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:
  kube-api-access-bqzkh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m2s                   default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-operator-validator-tmr65 to ip-172-31-12-242
  Normal   Pulling    4m1s                   kubelet            Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1"
  Normal   Pulled     3m58s                  kubelet            Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" in 2.846s (2.846s including waiting)
  Normal   Created    2m37s (x5 over 3m58s)  kubelet            Created container driver-validation
  Normal   Started    2m37s (x5 over 3m58s)  kubelet            Started container driver-validation
  Normal   Pulled     2m37s (x4 over 3m57s)  kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Warning  BackOff    2m22s (x9 over 3m56s)  kubelet            Back-off restarting failed container driver-validation in pod nvidia-operator-validator-tmr65_nvidia-gpu-operator(6148bb7e-3695-438e-9c5a-1817947a35be)
ubuntu@ip-172-31-12-242:~$

kubectl get pods --all-namespaces | grep -v kube-system output:

ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE             NAME                                                              READY   STATUS       RESTARTS       AGE
nvidia-gpu-operator   gpu-feature-discovery-lp6l2                                       0/1     Init:0/1     0              3m7s
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh   1/1     Running      0              3m20s
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62   1/1     Running      0              3m20s
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-worker-l7xnc       1/1     Running      0              3m20s
nvidia-gpu-operator   gpu-operator-6b7d7ffcb5-4qzq5                                     1/1     Running      0              3m20s
nvidia-gpu-operator   nvidia-dcgm-exporter-4mcl2                                        0/1     Init:0/1     0              3m7s
nvidia-gpu-operator   nvidia-device-plugin-daemonset-9bdgz                              0/1     Init:0/1     0              3m7s
nvidia-gpu-operator   nvidia-operator-validator-tmr65                                   0/1     Init:Error   5 (101s ago)   3m7s
ubuntu@ip-172-31-12-242:~$

And I did make sure the docker is running with nvidia runtime:

ubuntu@ip-172-31-12-242:~$ sudo docker info | grep -i runtime
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia

Btw, I found the developer guide is missing the command sudo mkdir -p /usr/share/keyrings for Installing CRI-O(Option 2) which is required for the installation.

angudadevops commented 5 months ago

that's strange!! can you also provide the logs with the below command

kubectl logs  -n nvidia-gpu-operator nvidia-operator-validator-tmr65 -c driver-validation

Quick question is there any way you can provide ssh access to this machine to debug ?

Thanks for the input, I will update the docs

yin19941005 commented 5 months ago

Hello,

Thank you for helping! The log of kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-tmr65 -c driver-validation :

ubuntu@ip-172-31-12-242:~$ kubectl logs  -n nvidia-gpu-operator nvidia-operator-validator-tmr65 -c driver-validation
time="2024-01-23T23:22:07Z" level=info msg="version: 8072420d"
time="2024-01-23T23:22:07Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Tue Jan 23 23:22:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   19C    P8               8W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
time="2024-01-23T23:22:07Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2024-01-23T23:22:08Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

I am not sure could I provide the ssh access to the aws instance, let me check with my colleague. But the issue is quite easy to reproduce, you may launch a new g4dn instance and follow the developer guide then the issue will occur.

angudadevops commented 5 months ago

ok that make sense, this is based on GPU config. you can run below command to fix this.

kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\    driver:\n      env:\n      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n        value: \"true\"" | kubectl apply -f -
yin19941005 commented 5 months ago

Hello,

I just tried your fix but with no luck, the following is the output:

ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE             NAME                                                              READY   STATUS                  RESTARTS        AGE
nvidia-gpu-operator   gpu-feature-discovery-lp6l2                                       0/1     Init:0/1                1               3d11h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh   1/1     Running                 1               3d11h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62   1/1     Running                 1               3d11h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-worker-l7xnc       1/1     Running                 1               3d11h
nvidia-gpu-operator   gpu-operator-6b7d7ffcb5-4qzq5                                     1/1     Running                 1               3d11h
nvidia-gpu-operator   nvidia-dcgm-exporter-4mcl2                                        0/1     Init:0/1                1               3d11h
nvidia-gpu-operator   nvidia-device-plugin-daemonset-9bdgz                              0/1     Init:0/1                1               3d11h
nvidia-gpu-operator   nvidia-operator-validator-tmr65                                   0/1     Init:CrashLoopBackOff   761 (39s ago)   3d11h
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\    driver:\n      env:\n      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n        value: \"true\"" | kubectl apply -f -'
>
> ^C
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system                                                                         NAMESPACE             NAME                                                              READY   STATUS                  RESTARTS        AGE
nvidia-gpu-operator   gpu-feature-discovery-lp6l2                                       0/1     Init:0/1                1               3d11h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh   1/1     Running                 1               3d11h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62   1/1     Running                 1               3d11h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-worker-l7xnc       1/1     Running                 1               3d11h
nvidia-gpu-operator   gpu-operator-6b7d7ffcb5-4qzq5                                     1/1     Running                 1               3d11h
nvidia-gpu-operator   nvidia-dcgm-exporter-4mcl2                                        0/1     Init:0/1                1               3d11h
nvidia-gpu-operator   nvidia-device-plugin-daemonset-9bdgz                              0/1     Init:0/1                1               3d11h
nvidia-gpu-operator   nvidia-operator-validator-tmr65                                   0/1     Init:CrashLoopBackOff   761 (70s ago)   3d11h

Do I supposed to input something after running that command? Or should I remove the GPU operator and re-install it after running the command?

angudadevops commented 5 months ago

@yin19941005 there was ' end of the command, updated the command. please try with updated one

yin19941005 commented 5 months ago

Hello,

Thank you for helping! It looks fixed part of the issue, the output as following:

ubuntu@ip-172-31-12-242:~$ kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\    driver:\n      env:\n      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n        value: \"true\"" | kubectl apply -f -
Warning: resource clusterpolicies/cluster-policy is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
clusterpolicy.nvidia.com/cluster-policy configured

But it seems we got new error of toolkit validation when I check with kubectl get pods --all-namespaces | grep -v kube-system and kubectl describe pod -n nvidia-gpu-operator nvidia-operator-validator-h8rn2:

ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE             NAME                                                              READY   STATUS       RESTARTS      AGE
nvidia-gpu-operator   gpu-feature-discovery-lp6l2                                       0/1     Init:0/1     2             3d12h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh   1/1     Running      2             3d12h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62   1/1     Running      2             3d12h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-worker-l7xnc       1/1     Running      2             3d12h
nvidia-gpu-operator   gpu-operator-6b7d7ffcb5-4qzq5                                     1/1     Running      2             3d12h
nvidia-gpu-operator   nvidia-dcgm-exporter-4mcl2                                        0/1     Init:0/1     2             3d12h
nvidia-gpu-operator   nvidia-device-plugin-daemonset-9bdgz                              0/1     Init:0/1     2             3d12h
nvidia-gpu-operator   nvidia-operator-validator-h8rn2                                   0/1     Init:Error   3 (36s ago)   52s
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE             NAME                                                              READY   STATUS       RESTARTS      AGE
nvidia-gpu-operator   gpu-feature-discovery-lp6l2                                       0/1     Init:0/1     2             3d12h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh   1/1     Running      2             3d12h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62   1/1     Running      2             3d12h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-worker-l7xnc       1/1     Running      2             3d12h
nvidia-gpu-operator   gpu-operator-6b7d7ffcb5-4qzq5                                     1/1     Running      2             3d12h
nvidia-gpu-operator   nvidia-dcgm-exporter-4mcl2                                        0/1     Init:0/1     2             3d12h
nvidia-gpu-operator   nvidia-device-plugin-daemonset-9bdgz                              0/1     Init:0/1     2             3d12h
nvidia-gpu-operator   nvidia-operator-validator-h8rn2                                   0/1     Init:Error   3 (36s ago)   52s
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl logs  -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c driver-validation
time="2024-01-24T20:11:21Z" level=info msg="version: 8072420d"
time="2024-01-24T20:11:21Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Wed Jan 24 20:11:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   22C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
ubuntu@ip-172-31-12-242:~$ kubectl describe pod  -n nvidia-gpu-operator nvidia-operator-validator-h8rn2
Name:                 nvidia-operator-validator-h8rn2
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-operator-validator
Node:                 ip-172-31-12-242/172.31.12.242
Start Time:           Wed, 24 Jan 2024 20:11:20 +0000
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=686f79ffdd
                      helm.sh/chart=gpu-operator-v23.9.1
                      pod-template-generation=2
Annotations:          cni.projectcalico.org/containerID: 6cf0018aaf4aa1fad0906fc79d7e6325bfc2df15c9308ea562c143af25955fbf
                      cni.projectcalico.org/podIP: 192.168.34.95/32
                      cni.projectcalico.org/podIPs: 192.168.34.95/32
Status:               Pending
IP:                   192.168.34.95
IPs:
  IP:           192.168.34.95
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  cri-o://4433ba7eafb656b1006090e77de2a1f9afcf8202672ff18ac8e74707728a28f4
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 24 Jan 2024 20:11:21 +0000
      Finished:     Wed, 24 Jan 2024 20:11:21 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                          true
      COMPONENT:                          driver
      DISABLE_DEV_CHAR_SYMLINK_CREATION:  true
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
  toolkit-validation:
    Container ID:  cri-o://af803f0d8c3413c47e436927aeaf215390baa366086da181281d0f4c83bee4f6
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 24 Jan 2024 20:12:51 +0000
      Finished:     Wed, 24 Jan 2024 20:12:51 +0000
    Ready:          False
    Restart Count:  4
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           nvidia-gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           nvidia-gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:
  kube-api-access-mhbxq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  2m16s                default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-operator-validator-h8rn2 to ip-172-31-12-242
  Normal   Pulled     2m16s                kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    2m15s                kubelet            Created container driver-validation
  Normal   Started    2m15s                kubelet            Started container driver-validation
  Normal   Started    95s (x4 over 2m15s)  kubelet            Started container toolkit-validation
  Warning  BackOff    57s (x8 over 2m13s)  kubelet            Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-h8rn2_nvidia-gpu-operator(a36ca846-a5e0-487d-ba59-7b90173a303e)
  Normal   Pulled     45s (x5 over 2m15s)  kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    45s (x5 over 2m15s)  kubelet            Created container toolkit-validation
angudadevops commented 5 months ago

can you share the logs with below command

kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation

Is there anyway that we can ssh to this machine to debug the issue ?

yin19941005 commented 5 months ago

Hello,

The output of kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation:

ubuntu@ip-172-31-12-242:~$ kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation
time="2024-01-24T22:19:57Z" level=info msg="version: 8072420d"
toolkit is not ready
time="2024-01-24T22:19:57Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
ubuntu@ip-172-31-12-242:~$ nvidia-smi
Wed Jan 24 22:20:58 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   21C    P8               8W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The nvidia-smi is definitely working on the instance, interesting.

Is there anyway that we can ssh to this machine to debug the issue ?

Yes, I just confirmed with my colleague, I can share the ssh access to you if you can share your AWS public key and the IP address, which you are going to use for connecting to the instance, then I could add your key to the instance and also to an AWS security group to allow the IP connecting in. You may generate a new key pair and simply discard it after this usage.

I could generate a new key pair for you and share the private key to you as an alternative but I don't think sharing a private key is a good practice.

angudadevops commented 5 months ago

can you try to delete the kubectl delete po -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 and still issue same.

Meanwhile I will try to replicate on my AWS instance depends on that I will share my details for remote debugging.

yin19941005 commented 5 months ago

Hello,

The issue still persist after delete the pod:

ubuntu@ip-172-31-12-242:~$ kubectl delete po -n nvidia-gpu-operator nvidia-operator-validator-h8rn2
pod "nvidia-operator-validator-h8rn2" deleted
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation
Error from server (NotFound): pods "nvidia-operator-validator-h8rn2" not found
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE             NAME                                                              READY   STATUS                  RESTARTS      AGE
nvidia-gpu-operator   gpu-feature-discovery-lp6l2                                       0/1     Init:0/1                2             3d15h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh   1/1     Running                 2             3d15h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62   1/1     Running                 2             3d15h
nvidia-gpu-operator   gpu-operator-1705822437-node-feature-discovery-worker-l7xnc       1/1     Running                 2             3d15h
nvidia-gpu-operator   gpu-operator-6b7d7ffcb5-4qzq5                                     1/1     Running                 2             3d15h
nvidia-gpu-operator   nvidia-dcgm-exporter-4mcl2                                        0/1     Init:0/1                2             3d15h
nvidia-gpu-operator   nvidia-device-plugin-daemonset-9bdgz                              0/1     Init:0/1                2             3d15h
nvidia-gpu-operator   nvidia-operator-validator-npb77                                   0/1     Init:CrashLoopBackOff   1 (10s ago)   12s
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl describe pod  -n nvidia-gpu-operator nvidia-operator-validator-npb77
Name:                 nvidia-operator-validator-npb77
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-operator-validator
Node:                 ip-172-31-12-242/172.31.12.242
Start Time:           Wed, 24 Jan 2024 23:11:40 +0000
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=686f79ffdd
                      helm.sh/chart=gpu-operator-v23.9.1
                      pod-template-generation=2
Annotations:          cni.projectcalico.org/containerID: 779ed2e318ee2eeec9723f23a628ba8ba7e1b6c542f4370765686cdb86f76499
                      cni.projectcalico.org/podIP: 192.168.34.96/32
                      cni.projectcalico.org/podIPs: 192.168.34.96/32
Status:               Pending
IP:                   192.168.34.96
IPs:
  IP:           192.168.34.96
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  cri-o://7d0ec5a4b6ddbce82afc7a1042821ee97dbf9ff5efab015a31f06e23ecf6181f
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 24 Jan 2024 23:11:41 +0000
      Finished:     Wed, 24 Jan 2024 23:11:41 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                          true
      COMPONENT:                          driver
      DISABLE_DEV_CHAR_SYMLINK_CREATION:  true
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
  toolkit-validation:
    Container ID:  cri-o://cfd4d22426e6c43785705f20beb8c7ce9e80af955dcd9c8c7f2f96932f4ee8ea
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 24 Jan 2024 23:11:56 +0000
      Finished:     Wed, 24 Jan 2024 23:11:56 +0000
    Ready:          False
    Restart Count:  2
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           nvidia-gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           nvidia-gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:
  kube-api-access-xlgvj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  37s                default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-operator-validator-npb77 to ip-172-31-12-242
  Normal   Pulled     36s                kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    36s                kubelet            Created container driver-validation
  Normal   Started    36s                kubelet            Started container driver-validation
  Normal   Pulled     21s (x3 over 36s)  kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    21s (x3 over 36s)  kubelet            Created container toolkit-validation
  Normal   Started    21s (x3 over 36s)  kubelet            Started container toolkit-validation
  Warning  BackOff    6s (x4 over 34s)   kubelet            Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-npb77_nvidia-gpu-operator(909b0d55-942c-4815-94d7-d10bcc92cbd2)

Meanwhile I will try to replicate on my AWS instance depends on that I will share my details for remote debugging.

Appreciated! Thank you!

angudadevops commented 5 months ago

@yin19941005 I just spin up a AWS ec2 instance with Ubuntu 22.04 and followed the steps as in the docs and it is working. Except this command need to run after install the GPU Operator

kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\    driver:\n      env:\n      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n        value: \"true\"" | kubectl apply -f -

I can see all pods are running

ubuntu@ip-172-31-14-20:~$ kubectl get pods -n nvidia-gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-zkkbx                                       1/1     Running     0          100s
gpu-operator-1706199284-node-feature-discovery-gc-5c874bbbr647s   1/1     Running     0          113s
gpu-operator-1706199284-node-feature-discovery-master-755bmc5v4   1/1     Running     0          113s
gpu-operator-1706199284-node-feature-discovery-worker-p6gzv       1/1     Running     0          113s
gpu-operator-5f4dfcdf9-dxlq7                                      1/1     Running     0          113s
nvidia-cuda-validator-7zphl                                       0/1     Completed   0          74s
nvidia-dcgm-exporter-99htt                                        1/1     Running     0          100s
nvidia-device-plugin-daemonset-fjvgm                              1/1     Running     0          100s
nvidia-operator-validator-czktn                                   1/1     Running     0          77s
yin19941005 commented 5 months ago

@angudadevops, thank you for helping! I am not permitted to change the firewall rule (security group) of the instance to allow all public ip. Is it possible to share your instance elastic ip? So that I can allow traffic on that ip and you can connect in through the instance.

Let me share my bash history with you, for your reference: bash_history_01.txt

angudadevops commented 5 months ago

@yin19941005 I was tried to use containerd not CRIO, after the installation I just found that there is an additional step need to be added. Thanks for the catch, I will update the docs for CRIO with below steps.

https://github.com/NVIDIA/cloud-native-stack/blob/master/install-guides/Ubuntu-22-04_Server_Developer-x86-arm64_v11.0.md#installing-cri-ooption-2

Run the below commands after Install the CRI-O and dependencies step.

sudo nano /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/bin/nvidia-container-runtime-hook",
        "args": [
            "nvidia-container-runtime-hook",
            "prestart"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ]
    },
    "when": {
        "always": true,
        "commands": [
            ".*"
        ]
    },
    "stages": [
        "prestart"
    ]
}
sudo systemctl daemon-reload && sudo systemctl restart crio.service 
yin19941005 commented 5 months ago

@angudadevops, thank you for quick response! The fix is working!! Very appreciated! I will close this issue.

FYR, the following are the changes I had done:

  1. Add the oci-nvidia-hook.json after Install the CRI-O and dependencies step as below:
sudo nano /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/bin/nvidia-container-runtime-hook",
        "args": [
            "nvidia-container-runtime-hook",
            "prestart"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ]
    },
    "when": {
        "always": true,
        "commands": [
            ".*"
        ]
    },
    "stages": [
        "prestart"
    ]
}
sudo systemctl daemon-reload && sudo systemctl restart crio.service
  1. Disable the symlink creation of the validator after installing the GPU Operator:
    kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\    driver:\n      env:\n      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n        value: \"true\"" | kubectl apply -f -