Closed calebmm closed 3 years ago
Same here. If you use kubectrl describe pod
with nvidia-smi
you will get detailed explanations. In my case, it says not enough (zero) GPUs while I have. It happens on EC2 g4 instances, too.
@calebmm @kazuf3 For nvidia-smi, please use the below command as it's changed. it will be updated with new release 3.1
kubectl run nvidia-smi --rm -t -i --restart=Never --image=nvidia/cuda:11.1.1-base --limits=nvidia.com/gpu=1 -- nvidia-smi
If the video analytics demo pod pending, you can run describe the pod and let us know
kubectl describe pod video-analytics-demo-pod-name
@angudadevops Thanks for the response!
Here is the output of(attached as log file as well): kubectl describe pod video-analytics-demo-0-1612379306-6559f4766d-kckf4
Name: video-analytics-demo-0-1612379306-6559f4766d-kckf4
Namespace: default
Priority: 0
Node:
IPs:
Warning FailedScheduling 2m29s (x14539 over 15d) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. kube_pod_output.log
I have followed the installations instructions for v2.0 (Ubuntu 18.04) and v3.0 (Ubuntu 20.04) but none seems to work. I always end up with the crashloopback error. I tried using the ansible playbooks and the install guides. Please whats the way forward?
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1614327265-node-feature-discovery-master-587bjzfdk 1/1 Running 0 3h55m default gpu-operator-1614327265-node-feature-discovery-worker-8k7dr 0/1 CrashLoopBackOff 42 3h55m default gpu-operator-6f5d966ff8-dqxmc 0/1 CrashLoopBackOff 50 3h55m kube-system calico-kube-controllers-59877c7fb4-jlpzp 1/1 Running 0 20h kube-system calico-node-f59xx 0/1 Running 346 20h kube-system coredns-66bff467f8-rdntv 1/1 Running 0 20h kube-system coredns-66bff467f8-sq5q2 1/1 Running 0 20h kube-system etcd-testpc 1/1 Running 0 20h kube-system kube-apiserver-testpc 1/1 Running 0 20h kube-system kube-controller-manager-testpc 1/1 Running 0 20h kube-system kube-proxy-z5r4n 1/1 Running 0 20h kube-system kube-scheduler-testpc 1/1 Running 0 20h
kubectl describe -n default pod gpu-operator-1614327265-node-feature-discovery-worker-8k7dr
`Name: gpu-operator-1614327265-node-feature-discovery-worker-8k7dr
Namespace: default
Priority: 0
Node: testpc/172.19.0.65
Start Time: Fri, 26 Feb 2021 09:14:27 +0100
Labels: app.kubernetes.io/component=worker
app.kubernetes.io/instance=gpu-operator-1614327265
app.kubernetes.io/name=node-feature-discovery
controller-revision-hash=76c746b548
pod-template-generation=1
Annotations: cni.projectcalico.org/podIP: 192.168.101.78/32
cni.projectcalico.org/podIPs: 192.168.101.78/32
Status: Running
IP: 192.168.101.78
IPs:
IP: 192.168.101.78
Controlled By: DaemonSet/gpu-operator-1614327265-node-feature-discovery-worker
Containers:
node-feature-discovery-master:
Container ID: docker://3d37c360a0d022ee264074997da4f2cf3a701bacd95f96a363a47220953e4978
Image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
Image ID: docker-pullable://quay.io/kubernetes_incubator/node-feature-discovery@sha256:a1e72dbc35a16cbdcf0007fc4fb207bce723ff67c61853d2d8d8051558ce6de7
Port:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
source-d:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/node-feature-discovery/source.d/
HostPathType:
features-d:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/node-feature-discovery/features.d/
HostPathType:
nfd-worker-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gpu-operator-1614327265-node-feature-discovery
Optional: false
gpu-operator-1614327265-node-feature-discovery-token-nk2wz:
Type: Secret (a volume populated by a Secret)
SecretName: gpu-operator-1614327265-node-feature-discovery-token-nk2wz
Optional: false
QoS Class: BestEffort
Node-Selectors:
Warning BackOff 9s (x885 over 3h53m) kubelet, testpc Back-off restarting failed container `
@angudadevops Thanks for the response!
Here is the output of(attached as log file as well): kubectl describe pod video-analytics-demo-0-1612379306-6559f4766d-kckf4
Name: video-analytics-demo-0-1612379306-6559f4766d-kckf4 Namespace: default Priority: 0 Node: Labels: app.kubernetes.io/instance=video-analytics-demo-0-1612379306 app.kubernetes.io/name=video-analytics-demo pod-template-hash=6559f4766d Annotations: rollme: 39cMH Status: Pending IP: IPs: Controlled By: ReplicaSet/video-analytics-demo-0-1612379306-6559f4766d Containers: video-analytics-demo-1: Image: nvcr.io/nvidia/deepstream:5.0-20.07-samples Ports: 8554/TCP, 5080/TCP Host Ports: 0/TCP, 0/TCP Command: sh -c apt update; apt install wget unzip -y; wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_trafficcamnet/versions/pruned_v1.0/zip -O tlt_trafficcamnet_pruned_v1.0.zip; unzip *.zip; cp -r resnet18_trafficcamnet_pruned.etlt /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/; sed -ie "s/../../models/tlt_pretrained_models/trafficcamnet/resnet18_trafficcamnet_pruned.etlt//opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/resnet18_trafficcamnet_pruned.etlt/g" /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/config_infer_primary_trafficcamnet.txt; python /opt/nvidia/deepstream/create_config.py deepstream-app /opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app/source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt Environment: Mounts: /etc/config from ipmount (rw) /opt/nvidia/deepstream/create_config.py from create-config (rw,path="create_config.py") /var/run/secrets/kubernetes.io/serviceaccount from default-token-gqrnk (ro) Conditions: Type Status PodScheduled False Volumes: ipmount: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-configmap Optional: false create-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-create-config Optional: false default-token-gqrnk: Type: Secret (a volume populated by a Secret) SecretName: default-token-gqrnk Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message
Warning FailedScheduling 2m29s (x14539 over 15d) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. kube_pod_output.log
@calebmm this looks like, issue with your master node. Run the below command to taint the master and node and retry the installation
kubectl taint nodes --all node-role.kubernetes.io/master-
CrashLoopBackOff
@egamor18, please follow the below steps and let us know if you still face the same issue or use the latest version of GPU Operator from 3.1 Guide.
1. Uninstall the GPU Operator
2. reboot the node
3. verify docker default run time in `/etc/docker/daemon.json`
4. if you see `"default-runtime" : "nvidia"` in `/etc/docker/daemon.json`, please remove the line and restart the docker service
5. Now install the GPU Operator
CrashLoopBackOff
@egamor18, please follow the below steps and let us know if you still face the same issue or use the latest version of GPU Operator from 3.1 Guide.
1. Uninstall the GPU Operator 2. reboot the node 3. verify docker default run time in `/etc/docker/daemon.json` 4. if you see `"default-runtime" : "nvidia"` in `/etc/docker/daemon.json`, please remove the line and restart the docker service 5. Now install the GPU Operator
Thank you @angudadevops . I tried with the version 3.1 as soon as I saw it and got a similar error (I will try it again later). Now upon seeing this response, i re-installed Ubuntu and tried version 3.0 again but without success. I removed the line "default-runtime" : "nvidia"
in the /etc/docker/daemon.jsonand saved it. I restarted docker with
sudo systemctl restart docker. When i checked after a few minutes, "default-runtime" : "nvidia"
re-appears in the daeman.json file again. sudo pkill -SIGHUP dockerd
and restarting the docker container didnt help either. What should I do?
CrashLoopBackOff
@egamor18, please follow the below steps and let us know if you still face the same issue or use the latest version of GPU Operator from 3.1 Guide.
1. Uninstall the GPU Operator 2. reboot the node 3. verify docker default run time in `/etc/docker/daemon.json` 4. if you see `"default-runtime" : "nvidia"` in `/etc/docker/daemon.json`, please remove the line and restart the docker service 5. Now install the GPU Operator
Thank you @angudadevops . I tried with the version 3.1 as soon as I saw it uploaded and got a similar error (I will try it again later). Now upon seeing this response, i re-installed Ubuntu and tried version 3.0 again. It is not resolved after following this instruction. I first had the problem of the line
"default-runtime" : "nvidia"
in/etc/docker/daemon.json
re-appearing. I solved it by uninstalling docker2 restarting docker, uninstalling the gpu-operator and restarting the machine. Checking that that line of text is not back and then reinstalling the gpu operator. But no success.
This is for version 3.1
kubectl get pods --all-namespaces
.
The results is:
NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1615369242-node-feature-discovery-master-77b6vm7pj 1/1 Running 3 33m default gpu-operator-1615369242-node-feature-discovery-worker-wcm45 1/1 Running 3 33m default gpu-operator-576f984b45-cnzdm 1/1 Running 4 33m gpu-operator-resources nvidia-container-toolkit-daemonset-55jf8 0/1 Init:0/1 4 32m gpu-operator-resources nvidia-driver-daemonset-zxh7q 0/1 CrashLoopBackOff 13 32m kube-system calico-kube-controllers-54658cf6f7-gbwmc 1/1 Running 3 43m kube-system calico-node-swj8c 1/1 Running 4 43m kube-system coredns-66bff467f8-m2kkx 1/1 Running 4 49m kube-system coredns-66bff467f8-q57n7 1/1 Running 4 49m kube-system etcd-testpc 1/1 Running 4 50m kube-system kube-apiserver-testpc 1/1 Running 5 50m kube-system kube-controller-manager-testpc 1/1 Running 4 50m kube-system kube-proxy-2mvbm 1/1 Running 5 49m kube-system kube-scheduler-testpc 1/1 Running 4 50m
The description of the errors:
kubectl describe -n gpu-operator-resources pod nvidia-driver-daemonset-zxh7q
.
The results is:
`Name: nvidia-driver-daemonset-zxh7q
Namespace: gpu-operator-resources
Priority: 0
Node: testpc/172.19.0.65
Start Time: Wed, 10 Mar 2021 10:41:09 +0100
Labels: app=nvidia-driver-daemonset
controller-revision-hash=5cdf5f5997
pod-template-generation=1
Annotations: cni.projectcalico.org/podIP: 192.168.101.96/32
cni.projectcalico.org/podIPs: 192.168.101.96/32
scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 192.168.101.96
IPs:
IP: 192.168.101.96
Controlled By: DaemonSet/nvidia-driver-daemonset
Containers:
nvidia-driver-ctr:
Container ID: docker://5e808a5f1d574b000cbca6ed5cfa42a8eefc715c7a4749151e67518df8fff79d
Image: nvcr.io/nvidia/driver:460.32.03-ubuntu20.04
Image ID: docker-pullable://nvcr.io/nvidia/driver@sha256:8a1d9a3c790ad93c67359177c3c8e690a0e2445e1a372e024346db429a58a086
Port:
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-driver
Optional: false
nvidia-driver-token-pb5fr:
Type: Secret (a volume populated by a Secret)
SecretName: nvidia-driver-token-pb5fr
Optional: false
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.present=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
Normal Scheduled 34m default-scheduler Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-zxh7q to testpc Normal Pulling 34m kubelet Pulling image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" Normal Pulled 34m kubelet Successfully pulled image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" Normal Created 31m (x5 over 34m) kubelet Created container nvidia-driver-ctr Normal Started 31m (x5 over 34m) kubelet Started container nvidia-driver-ctr Normal Pulled 31m (x4 over 34m) kubelet Container image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" already present on machine Warning BackOff 19m (x55 over 33m) kubelet Back-off restarting failed container Warning FailedMount 12m kubelet MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition Normal SandboxChanged 11m (x2 over 12m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 9m5s (x4 over 11m) kubelet Container image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" already present on machine Normal Created 9m4s (x4 over 11m) kubelet Created container nvidia-driver-ctr Normal Started 9m4s (x4 over 11m) kubelet Started container nvidia-driver-ctr Warning BackOff 2m (x35 over 11m) kubelet Back-off restarting failed container `
kubectl describe -n gpu-operator-resources pod nvidia-container-toolkit-daemonset-55jf8
.
The result is:
`Name: nvidia-container-toolkit-daemonset-55jf8
Namespace: gpu-operator-resources
Priority: 0
Node: testpc/172.19.0.65
Start Time: Wed, 10 Mar 2021 10:41:49 +0100
Labels: app=nvidia-container-toolkit-daemonset
controller-revision-hash=5bd8bf679d
pod-template-generation=1
Annotations: cni.projectcalico.org/podIP: 192.168.101.97/32
cni.projectcalico.org/podIPs: 192.168.101.97/32
scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP: 192.168.101.97
IPs:
IP: 192.168.101.97
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: docker://fdb1ee5ffa1bfafbefd5f955e9f1b73352d6b30ab022994ba25633cfaea45d99
Image: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
Image ID: docker-pullable://nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
Port: |
grep -v '^[[:space:]]' | cut -d':' -f1 | tr '[[:space:]]' ':'); export NVIDIA_LIBRARY_PATH=/run/nvidia/driver/usr/lib/x86_64-linux-gnu/:/run/nvidia/driver/usr/lib64; export LD_LIBRARY_PATH=${SYS_LIBRARY_PATH}:${NVIDIA_LIBRARY_PATH}; echo ${LD_LIBRARY_PATH}; export PATH=/run/nvidia/driver/usr/bin/:${PATH}; until nvidia-smi; do echo waiting for nvidia drivers to be loaded; sleep 5; done
State: Running
Started: Wed, 10 Mar 2021 11:04:42 +0100
Ready: False
Restart Count: 4
Environment: Image: nvcr.io/nvidia/k8s/container-toolkit:1.4.5-ubuntu18.04 Image ID: Port: nvidia-local: Type: HostPath (bare host directory volume) Path: /usr/local/nvidia HostPathType: crio-hooks: Type: HostPath (bare host directory volume) Path: /etc/containers/oci/hooks.d HostPathType: docker-config: Type: HostPath (bare host directory volume) Path: /etc/docker HostPathType: docker-socket: Type: HostPath (bare host directory volume) Path: /var/run HostPathType: nvidia-container-toolkit-token-hzl87: Type: Secret (a volume populated by a Secret) SecretName: nvidia-container-toolkit-token-hzl87 Optional: false QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.present=true Tolerations: CriticalAddonsOnly node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/pid-pressure:NoSchedule node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule nvidia.com/gpu:NoSchedule Events: Type Reason Age From Message |
---|
Normal Scheduled 50m default-scheduler Successfully assigned gpu-operator-resources/nvidia-container-toolkit-daemonset-55jf8 to testpc Normal Pulling 50m kubelet Pulling image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" Normal Pulled 50m kubelet Successfully pulled image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" Normal SandboxChanged 38m (x11 over 43m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Created 38m (x4 over 50m) kubelet Created container driver-validation Normal Started 38m (x4 over 50m) kubelet Started container driver-validation Normal Pulled 38m (x3 over 42m) kubelet Container image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" already present on machine Warning FailedMount 28m kubelet MountVolume.SetUp failed for volume "nvidia-container-toolkit-token-hzl87" : failed to sync secret cache: timed out waiting for the condition Normal SandboxChanged 27m (x2 over 28m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 27m kubelet Container image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" already present on machine Normal Created 27m kubelet Created container driver-validation Normal Started 27m kubelet Started container driver-validation `
the content of the daemon.json is:
{ "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
Installed the container runtime with:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1615369242-node-feature-discovery-master-77b6vm7pj 1/1 Running 3 33m default gpu-operator-1615369242-node-feature-discovery-worker-wcm45 1/1 Running 3 33m default gpu-operator-576f984b45-cnzdm 1/1 Running 4 33m gpu-operator-resources nvidia-container-toolkit-daemonset-55jf8 0/1 Init:0/1 4 32m gpu-operator-resources nvidia-driver-daemonset-zxh7q 0/1 CrashLoopBackOff 13 32m kube-system calico-kube-controllers-54658cf6f7-gbwmc 1/1 Running 3 43m kube-system calico-node-swj8c 1/1 Running 4 43m kube-system coredns-66bff467f8-m2kkx 1/1 Running 4 49m kube-system coredns-66bff467f8-q57n7 1/1 Running 4 49m kube-system etcd-testpc 1/1 Running 4 50m kube-system kube-apiserver-testpc 1/1 Running 5 50m kube-system kube-controller-manager-testpc 1/1 Running 4 50m kube-system kube-proxy-2mvbm 1/1 Running 5 49m kube-system kube-scheduler-testpc 1/1 Running 4 50m
@egamor18 looks like driver-daemonset failed to run, this is issue with GPU Operator not 3.1 stack. can you please let us know what's the GPU you're using and also please check driver-daemonset logs.
For nvidia-smi
:
Tue Mar 16 14:55:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:AF:00.0 On | Off |
| 35% 28C P8 19W / 260W | 324MiB / 24217MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1478 G /usr/lib/xorg/Xorg 39MiB |
| 0 N/A N/A 14527 G /usr/lib/xorg/Xorg 121MiB |
| 0 N/A N/A 14747 G /usr/bin/gnome-shell 33MiB |
| 0 N/A N/A 31237 G ...AAAAAAAAA= --shared-files 119MiB |
+-----------------------------------------------------------------------------+
For the logs kubectl logs -n gpu-operator-resources -p nvidia-driver-daemonset-s9rld
:
`Creating directory NVIDIA-Linux-x86_64-460.32.03 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.8.0-44-generic
Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use`
For
nvidia-smi
:Tue Mar 16 14:55:46 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 6000 Off | 00000000:AF:00.0 On | Off | | 35% 28C P8 19W / 260W | 324MiB / 24217MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1478 G /usr/lib/xorg/Xorg 39MiB | | 0 N/A N/A 14527 G /usr/lib/xorg/Xorg 121MiB | | 0 N/A N/A 14747 G /usr/bin/gnome-shell 33MiB | | 0 N/A N/A 31237 G ...AAAAAAAAA= --shared-files 119MiB | +-----------------------------------------------------------------------------+
` For the logs
kubectl logs -n gpu-operator-resources -p nvidia-driver-daemonset-s9rld` :`Creating directory NVIDIA-Linux-x86_64-460.32.03 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.8.0-44-generic
Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use`
@egamor18 , if you noticed nvidia-driver kernel is in use. please unload the kernel modules and reboot the instance. If you encounter any more issues related to GPU Operator. Please raise an issue with https://github.com/NVIDIA/gpu-operator/issues.
@angudadevops Thanks for the response!
Here is the output of(attached as log file as well): kubectl describe pod video-analytics-demo-0-1612379306-6559f4766d-kckf4
Name: video-analytics-demo-0-1612379306-6559f4766d-kckf4 Namespace: default Priority: 0 Node: Labels: app.kubernetes.io/instance=video-analytics-demo-0-1612379306 app.kubernetes.io/name=video-analytics-demo pod-template-hash=6559f4766d Annotations: rollme: 39cMH Status: Pending IP: IPs: Controlled By: ReplicaSet/video-analytics-demo-0-1612379306-6559f4766d Containers: video-analytics-demo-1: Image: nvcr.io/nvidia/deepstream:5.0-20.07-samples Ports: 8554/TCP, 5080/TCP Host Ports: 0/TCP, 0/TCP Command: sh -c apt update; apt install wget unzip -y; wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_trafficcamnet/versions/pruned_v1.0/zip -O tlt_trafficcamnet_pruned_v1.0.zip; unzip *.zip; cp -r resnet18_trafficcamnet_pruned.etlt /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/; sed -ie "s/../../models/tlt_pretrained_models/trafficcamnet/resnet18_trafficcamnet_pruned.etlt//opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/resnet18_trafficcamnet_pruned.etlt/g" /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/config_infer_primary_trafficcamnet.txt; python /opt/nvidia/deepstream/create_config.py deepstream-app /opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app/source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt Environment: Mounts: /etc/config from ipmount (rw) /opt/nvidia/deepstream/create_config.py from create-config (rw,path="create_config.py") /var/run/secrets/kubernetes.io/serviceaccount from default-token-gqrnk (ro) Conditions: Type Status PodScheduled False Volumes: ipmount: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-configmap Optional: false create-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-create-config Optional: false default-token-gqrnk: Type: Secret (a volume populated by a Secret) SecretName: default-token-gqrnk Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message
Warning FailedScheduling 2m29s (x14539 over 15d) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. kube_pod_output.log
@calebmm @kazuf3 are you encountering the same issue. are you able to run nvidia-smi pod ?
please unload the kernel modules @angudadevops, can you please show me how to do this? Thank you.
@calebmm @kazuf3 as there's no response closing this issue.
I have followed the steps outlined in the Ubuntu Server V3.0 install guide, and the validation steps are not working.
First I tried the nvidia-smi example, but the nvidia-smi pod just showed as "Pending" when all other pods were shown as "Running"
Second, I tried the Video Analytics Demo. The install from the helm chart seemed to be successful, and helm reported that the pod was deployed. When I check the pod status, it shows as "pending" as well, while all other pods (other than nvidia-smi) show as "Running".
None of the install steps showed an error, but does this mean that my installation was not successful? How can i find some information on debugging a "Pending" pod.