Closed yin19941005 closed 5 months ago
@yin19941005 can you run provide the below commands output
nvidia-smi
kubectl describe pod -n nvidia-gpu-operator nvidia-operator-validator-cst7k
Did you reboot the server after you install the cuda/nvidia driver ?
Hello @angudadevops,
Yes, I did reboot the server after installing the Nvidia driver. And I have the nvidia-smi
output:
ubuntu@ip-172-31-12-242:~$ nvidia-smi
Sun Jan 21 07:15:41 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 24C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The kubectl describe pod output:
ubuntu@ip-172-31-12-242:~$ kubectl describe pod -n nvidia-gpu-operator nvidia-operator-validator-tmr65
Name: nvidia-operator-validator-tmr65
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-operator-validator
Node: ip-172-31-12-242/172.31.12.242
Start Time: Sun, 21 Jan 2024 07:34:11 +0000
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=656bd5c76b
helm.sh/chart=gpu-operator-v23.9.1
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: d74744821e914b24284adae9122a0ed9d3ef96b94b8ae4995628a0107b6f3ca6
cni.projectcalico.org/podIP: 192.168.34.69/32
cni.projectcalico.org/podIPs: 192.168.34.69/32
Status: Pending
IP: 192.168.34.69
IPs:
IP: 192.168.34.69
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: cri-o://d53e893e8365259509658518c37c6288eb5ecb1de270ba0bb516afb00ed410d1
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 21 Jan 2024 07:37:06 +0000
Finished: Sun, 21 Jan 2024 07:37:07 +0000
Ready: False
Restart Count: 5
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
toolkit-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bqzkh (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-bqzkh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m2s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-operator-validator-tmr65 to ip-172-31-12-242
Normal Pulling 4m1s kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1"
Normal Pulled 3m58s kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" in 2.846s (2.846s including waiting)
Normal Created 2m37s (x5 over 3m58s) kubelet Created container driver-validation
Normal Started 2m37s (x5 over 3m58s) kubelet Started container driver-validation
Normal Pulled 2m37s (x4 over 3m57s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Warning BackOff 2m22s (x9 over 3m56s) kubelet Back-off restarting failed container driver-validation in pod nvidia-operator-validator-tmr65_nvidia-gpu-operator(6148bb7e-3695-438e-9c5a-1817947a35be)
ubuntu@ip-172-31-12-242:~$
kubectl get pods --all-namespaces | grep -v kube-system
output:
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-gpu-operator gpu-feature-discovery-lp6l2 0/1 Init:0/1 0 3m7s
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh 1/1 Running 0 3m20s
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62 1/1 Running 0 3m20s
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-worker-l7xnc 1/1 Running 0 3m20s
nvidia-gpu-operator gpu-operator-6b7d7ffcb5-4qzq5 1/1 Running 0 3m20s
nvidia-gpu-operator nvidia-dcgm-exporter-4mcl2 0/1 Init:0/1 0 3m7s
nvidia-gpu-operator nvidia-device-plugin-daemonset-9bdgz 0/1 Init:0/1 0 3m7s
nvidia-gpu-operator nvidia-operator-validator-tmr65 0/1 Init:Error 5 (101s ago) 3m7s
ubuntu@ip-172-31-12-242:~$
And I did make sure the docker is running with nvidia runtime:
ubuntu@ip-172-31-12-242:~$ sudo docker info | grep -i runtime
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
Btw, I found the developer guide is missing the command sudo mkdir -p /usr/share/keyrings
for Installing CRI-O(Option 2) which is required for the installation.
that's strange!! can you also provide the logs with the below command
kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-tmr65 -c driver-validation
Quick question is there any way you can provide ssh access to this machine to debug ?
Thanks for the input, I will update the docs
Hello,
Thank you for helping! The log of kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-tmr65 -c driver-validation
:
ubuntu@ip-172-31-12-242:~$ kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-tmr65 -c driver-validation
time="2024-01-23T23:22:07Z" level=info msg="version: 8072420d"
time="2024-01-23T23:22:07Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Tue Jan 23 23:22:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 19C P8 8W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
time="2024-01-23T23:22:07Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2024-01-23T23:22:08Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
I am not sure could I provide the ssh access to the aws instance, let me check with my colleague. But the issue is quite easy to reproduce, you may launch a new g4dn instance and follow the developer guide then the issue will occur.
ok that make sense, this is based on GPU config. you can run below command to fix this.
kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\ driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\"" | kubectl apply -f -
Hello,
I just tried your fix but with no luck, the following is the output:
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-gpu-operator gpu-feature-discovery-lp6l2 0/1 Init:0/1 1 3d11h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh 1/1 Running 1 3d11h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62 1/1 Running 1 3d11h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-worker-l7xnc 1/1 Running 1 3d11h
nvidia-gpu-operator gpu-operator-6b7d7ffcb5-4qzq5 1/1 Running 1 3d11h
nvidia-gpu-operator nvidia-dcgm-exporter-4mcl2 0/1 Init:0/1 1 3d11h
nvidia-gpu-operator nvidia-device-plugin-daemonset-9bdgz 0/1 Init:0/1 1 3d11h
nvidia-gpu-operator nvidia-operator-validator-tmr65 0/1 Init:CrashLoopBackOff 761 (39s ago) 3d11h
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\ driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\"" | kubectl apply -f -'
>
> ^C
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-gpu-operator gpu-feature-discovery-lp6l2 0/1 Init:0/1 1 3d11h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh 1/1 Running 1 3d11h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62 1/1 Running 1 3d11h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-worker-l7xnc 1/1 Running 1 3d11h
nvidia-gpu-operator gpu-operator-6b7d7ffcb5-4qzq5 1/1 Running 1 3d11h
nvidia-gpu-operator nvidia-dcgm-exporter-4mcl2 0/1 Init:0/1 1 3d11h
nvidia-gpu-operator nvidia-device-plugin-daemonset-9bdgz 0/1 Init:0/1 1 3d11h
nvidia-gpu-operator nvidia-operator-validator-tmr65 0/1 Init:CrashLoopBackOff 761 (70s ago) 3d11h
Do I supposed to input something after running that command? Or should I remove the GPU operator and re-install it after running the command?
@yin19941005 there was '
end of the command, updated the command. please try with updated one
Hello,
Thank you for helping! It looks fixed part of the issue, the output as following:
ubuntu@ip-172-31-12-242:~$ kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\ driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\"" | kubectl apply -f -
Warning: resource clusterpolicies/cluster-policy is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
clusterpolicy.nvidia.com/cluster-policy configured
But it seems we got new error of toolkit validation when I check with kubectl get pods --all-namespaces | grep -v kube-system
and kubectl describe pod -n nvidia-gpu-operator nvidia-operator-validator-h8rn2
:
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-gpu-operator gpu-feature-discovery-lp6l2 0/1 Init:0/1 2 3d12h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh 1/1 Running 2 3d12h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62 1/1 Running 2 3d12h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-worker-l7xnc 1/1 Running 2 3d12h
nvidia-gpu-operator gpu-operator-6b7d7ffcb5-4qzq5 1/1 Running 2 3d12h
nvidia-gpu-operator nvidia-dcgm-exporter-4mcl2 0/1 Init:0/1 2 3d12h
nvidia-gpu-operator nvidia-device-plugin-daemonset-9bdgz 0/1 Init:0/1 2 3d12h
nvidia-gpu-operator nvidia-operator-validator-h8rn2 0/1 Init:Error 3 (36s ago) 52s
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-gpu-operator gpu-feature-discovery-lp6l2 0/1 Init:0/1 2 3d12h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh 1/1 Running 2 3d12h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62 1/1 Running 2 3d12h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-worker-l7xnc 1/1 Running 2 3d12h
nvidia-gpu-operator gpu-operator-6b7d7ffcb5-4qzq5 1/1 Running 2 3d12h
nvidia-gpu-operator nvidia-dcgm-exporter-4mcl2 0/1 Init:0/1 2 3d12h
nvidia-gpu-operator nvidia-device-plugin-daemonset-9bdgz 0/1 Init:0/1 2 3d12h
nvidia-gpu-operator nvidia-operator-validator-h8rn2 0/1 Init:Error 3 (36s ago) 52s
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c driver-validation
time="2024-01-24T20:11:21Z" level=info msg="version: 8072420d"
time="2024-01-24T20:11:21Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Wed Jan 24 20:11:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 22C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
ubuntu@ip-172-31-12-242:~$ kubectl describe pod -n nvidia-gpu-operator nvidia-operator-validator-h8rn2
Name: nvidia-operator-validator-h8rn2
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-operator-validator
Node: ip-172-31-12-242/172.31.12.242
Start Time: Wed, 24 Jan 2024 20:11:20 +0000
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=686f79ffdd
helm.sh/chart=gpu-operator-v23.9.1
pod-template-generation=2
Annotations: cni.projectcalico.org/containerID: 6cf0018aaf4aa1fad0906fc79d7e6325bfc2df15c9308ea562c143af25955fbf
cni.projectcalico.org/podIP: 192.168.34.95/32
cni.projectcalico.org/podIPs: 192.168.34.95/32
Status: Pending
IP: 192.168.34.95
IPs:
IP: 192.168.34.95
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: cri-o://4433ba7eafb656b1006090e77de2a1f9afcf8202672ff18ac8e74707728a28f4
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 24 Jan 2024 20:11:21 +0000
Finished: Wed, 24 Jan 2024 20:11:21 +0000
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
DISABLE_DEV_CHAR_SYMLINK_CREATION: true
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
toolkit-validation:
Container ID: cri-o://af803f0d8c3413c47e436927aeaf215390baa366086da181281d0f4c83bee4f6
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 24 Jan 2024 20:12:51 +0000
Finished: Wed, 24 Jan 2024 20:12:51 +0000
Ready: False
Restart Count: 4
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhbxq (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-mhbxq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m16s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-operator-validator-h8rn2 to ip-172-31-12-242
Normal Pulled 2m16s kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Normal Created 2m15s kubelet Created container driver-validation
Normal Started 2m15s kubelet Started container driver-validation
Normal Started 95s (x4 over 2m15s) kubelet Started container toolkit-validation
Warning BackOff 57s (x8 over 2m13s) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-h8rn2_nvidia-gpu-operator(a36ca846-a5e0-487d-ba59-7b90173a303e)
Normal Pulled 45s (x5 over 2m15s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Normal Created 45s (x5 over 2m15s) kubelet Created container toolkit-validation
can you share the logs with below command
kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation
Is there anyway that we can ssh to this machine to debug the issue ?
Hello,
The output of kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation
:
ubuntu@ip-172-31-12-242:~$ kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation
time="2024-01-24T22:19:57Z" level=info msg="version: 8072420d"
toolkit is not ready
time="2024-01-24T22:19:57Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
ubuntu@ip-172-31-12-242:~$ nvidia-smi
Wed Jan 24 22:20:58 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 21C P8 8W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The nvidia-smi
is definitely working on the instance, interesting.
Is there anyway that we can ssh to this machine to debug the issue ?
Yes, I just confirmed with my colleague, I can share the ssh access to you if you can share your AWS public key and the IP address, which you are going to use for connecting to the instance, then I could add your key to the instance and also to an AWS security group to allow the IP connecting in. You may generate a new key pair and simply discard it after this usage.
I could generate a new key pair for you and share the private key to you as an alternative but I don't think sharing a private key is a good practice.
can you try to delete the kubectl delete po -n nvidia-gpu-operator nvidia-operator-validator-h8rn2
and still issue same.
Meanwhile I will try to replicate on my AWS instance depends on that I will share my details for remote debugging.
Hello,
The issue still persist after delete the pod:
ubuntu@ip-172-31-12-242:~$ kubectl delete po -n nvidia-gpu-operator nvidia-operator-validator-h8rn2
pod "nvidia-operator-validator-h8rn2" deleted
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl logs -n nvidia-gpu-operator nvidia-operator-validator-h8rn2 -c toolkit-validation
Error from server (NotFound): pods "nvidia-operator-validator-h8rn2" not found
ubuntu@ip-172-31-12-242:~$ kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-gpu-operator gpu-feature-discovery-lp6l2 0/1 Init:0/1 2 3d15h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-gc-8b4757c8np8qh 1/1 Running 2 3d15h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-master-fffb6xn62 1/1 Running 2 3d15h
nvidia-gpu-operator gpu-operator-1705822437-node-feature-discovery-worker-l7xnc 1/1 Running 2 3d15h
nvidia-gpu-operator gpu-operator-6b7d7ffcb5-4qzq5 1/1 Running 2 3d15h
nvidia-gpu-operator nvidia-dcgm-exporter-4mcl2 0/1 Init:0/1 2 3d15h
nvidia-gpu-operator nvidia-device-plugin-daemonset-9bdgz 0/1 Init:0/1 2 3d15h
nvidia-gpu-operator nvidia-operator-validator-npb77 0/1 Init:CrashLoopBackOff 1 (10s ago) 12s
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$
ubuntu@ip-172-31-12-242:~$ kubectl describe pod -n nvidia-gpu-operator nvidia-operator-validator-npb77
Name: nvidia-operator-validator-npb77
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-operator-validator
Node: ip-172-31-12-242/172.31.12.242
Start Time: Wed, 24 Jan 2024 23:11:40 +0000
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=686f79ffdd
helm.sh/chart=gpu-operator-v23.9.1
pod-template-generation=2
Annotations: cni.projectcalico.org/containerID: 779ed2e318ee2eeec9723f23a628ba8ba7e1b6c542f4370765686cdb86f76499
cni.projectcalico.org/podIP: 192.168.34.96/32
cni.projectcalico.org/podIPs: 192.168.34.96/32
Status: Pending
IP: 192.168.34.96
IPs:
IP: 192.168.34.96
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: cri-o://7d0ec5a4b6ddbce82afc7a1042821ee97dbf9ff5efab015a31f06e23ecf6181f
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 24 Jan 2024 23:11:41 +0000
Finished: Wed, 24 Jan 2024 23:11:41 +0000
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
DISABLE_DEV_CHAR_SYMLINK_CREATION: true
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
toolkit-validation:
Container ID: cri-o://cfd4d22426e6c43785705f20beb8c7ce9e80af955dcd9c8c7f2f96932f4ee8ea
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 24 Jan 2024 23:11:56 +0000
Finished: Wed, 24 Jan 2024 23:11:56 +0000
Ready: False
Restart Count: 2
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xlgvj (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-xlgvj:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-operator-validator-npb77 to ip-172-31-12-242
Normal Pulled 36s kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Normal Created 36s kubelet Created container driver-validation
Normal Started 36s kubelet Started container driver-validation
Normal Pulled 21s (x3 over 36s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Normal Created 21s (x3 over 36s) kubelet Created container toolkit-validation
Normal Started 21s (x3 over 36s) kubelet Started container toolkit-validation
Warning BackOff 6s (x4 over 34s) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-npb77_nvidia-gpu-operator(909b0d55-942c-4815-94d7-d10bcc92cbd2)
Meanwhile I will try to replicate on my AWS instance depends on that I will share my details for remote debugging.
Appreciated! Thank you!
@yin19941005 I just spin up a AWS ec2 instance with Ubuntu 22.04 and followed the steps as in the docs and it is working. Except this command need to run after install the GPU Operator
kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\ driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\"" | kubectl apply -f -
I can see all pods are running
ubuntu@ip-172-31-14-20:~$ kubectl get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-zkkbx 1/1 Running 0 100s
gpu-operator-1706199284-node-feature-discovery-gc-5c874bbbr647s 1/1 Running 0 113s
gpu-operator-1706199284-node-feature-discovery-master-755bmc5v4 1/1 Running 0 113s
gpu-operator-1706199284-node-feature-discovery-worker-p6gzv 1/1 Running 0 113s
gpu-operator-5f4dfcdf9-dxlq7 1/1 Running 0 113s
nvidia-cuda-validator-7zphl 0/1 Completed 0 74s
nvidia-dcgm-exporter-99htt 1/1 Running 0 100s
nvidia-device-plugin-daemonset-fjvgm 1/1 Running 0 100s
nvidia-operator-validator-czktn 1/1 Running 0 77s
@angudadevops, thank you for helping! I am not permitted to change the firewall rule (security group) of the instance to allow all public ip. Is it possible to share your instance elastic ip? So that I can allow traffic on that ip and you can connect in through the instance.
Let me share my bash history with you, for your reference: bash_history_01.txt
@yin19941005 I was tried to use containerd not CRIO, after the installation I just found that there is an additional step need to be added. Thanks for the catch, I will update the docs for CRIO with below steps.
Run the below commands after Install the CRI-O and dependencies
step.
sudo nano /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
"version": "1.0.0",
"hook": {
"path": "/usr/bin/nvidia-container-runtime-hook",
"args": [
"nvidia-container-runtime-hook",
"prestart"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"when": {
"always": true,
"commands": [
".*"
]
},
"stages": [
"prestart"
]
}
sudo systemctl daemon-reload && sudo systemctl restart crio.service
@angudadevops, thank you for quick response! The fix is working!! Very appreciated! I will close this issue.
FYR, the following are the changes I had done:
Install the CRI-O and dependencies
step as below:sudo nano /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
"version": "1.0.0",
"hook": {
"path": "/usr/bin/nvidia-container-runtime-hook",
"args": [
"nvidia-container-runtime-hook",
"prestart"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"when": {
"always": true,
"commands": [
".*"
]
},
"stages": [
"prestart"
]
}
sudo systemctl daemon-reload && sudo systemctl restart crio.service
kubectl get clusterpolicy cluster-policy -o yaml | sed "/validator:/a\ driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\"" | kubectl apply -f -
Hello,
I am trying to setup NVIDIA Cloud Native Stack v11.0 for Developers on AWS with Ubuntu v22.04. I followed the install guide but the
GPU Operator
doesn't start correctly.Platform: AWS Instance type: g4dn.4xlarge OS: Ubuntu v22.04 (AMI: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20231207)
Once finished the
Installing GPU Operator
and start verifying theGPU Operator
and I got the following error message:The gpu-operator is not functioning:
I tried to launch another instance but it ends with same result. Then I trying to print the logs:
I had followed another install guide for Ubuntu Server which does not install CUDA driver and it works. What had I missed on the guide? I tried to match the CUDA driver (the version that
nvidia-smi
print out) on running thehelm install
command for theGPU Operator
but it doesn't help as well.