NVIDIA / cloud-native-stack

Run cloud native workloads on NVIDIA GPUs
Apache License 2.0
119 stars 47 forks source link

Video Analytics Demo and Nvidia-smi pod just "pending" after being deployed #6

Closed calebmm closed 3 years ago

calebmm commented 3 years ago

I have followed the steps outlined in the Ubuntu Server V3.0 install guide, and the validation steps are not working.

First I tried the nvidia-smi example, but the nvidia-smi pod just showed as "Pending" when all other pods were shown as "Running"

Second, I tried the Video Analytics Demo. The install from the helm chart seemed to be successful, and helm reported that the pod was deployed. When I check the pod status, it shows as "pending" as well, while all other pods (other than nvidia-smi) show as "Running".

None of the install steps showed an error, but does this mean that my installation was not successful? How can i find some information on debugging a "Pending" pod.

kazuf3 commented 3 years ago

Same here. If you use kubectrl describe pod with nvidia-smi you will get detailed explanations. In my case, it says not enough (zero) GPUs while I have. It happens on EC2 g4 instances, too.

angudadevops commented 3 years ago

@calebmm @kazuf3 For nvidia-smi, please use the below command as it's changed. it will be updated with new release 3.1

kubectl run nvidia-smi --rm -t -i --restart=Never --image=nvidia/cuda:11.1.1-base --limits=nvidia.com/gpu=1 -- nvidia-smi

If the video analytics demo pod pending, you can run describe the pod and let us know

kubectl describe pod video-analytics-demo-pod-name

calebmm commented 3 years ago

@angudadevops Thanks for the response!

Here is the output of(attached as log file as well): kubectl describe pod video-analytics-demo-0-1612379306-6559f4766d-kckf4

Name: video-analytics-demo-0-1612379306-6559f4766d-kckf4 Namespace: default Priority: 0 Node: Labels: app.kubernetes.io/instance=video-analytics-demo-0-1612379306 app.kubernetes.io/name=video-analytics-demo pod-template-hash=6559f4766d Annotations: rollme: 39cMH Status: Pending IP:
IPs: Controlled By: ReplicaSet/video-analytics-demo-0-1612379306-6559f4766d Containers: video-analytics-demo-1: Image: nvcr.io/nvidia/deepstream:5.0-20.07-samples Ports: 8554/TCP, 5080/TCP Host Ports: 0/TCP, 0/TCP Command: sh -c apt update; apt install wget unzip -y; wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_trafficcamnet/versions/pruned_v1.0/zip -O tlt_trafficcamnet_pruned_v1.0.zip; unzip *.zip; cp -r resnet18_trafficcamnet_pruned.etlt /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/; sed -ie "s/..\/..\/models\/tlt_pretrained_models\/trafficcamnet\/resnet18_trafficcamnet_pruned.etlt/\/opt\/nvidia\/deepstream\/deepstream-5.0\/samples\/configs\/tlt_pretrained_models\/resnet18_trafficcamnet_pruned.etlt/g" /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/config_infer_primary_trafficcamnet.txt; python /opt/nvidia/deepstream/create_config.py deepstream-app /opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app/source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt Environment: Mounts: /etc/config from ipmount (rw) /opt/nvidia/deepstream/create_config.py from create-config (rw,path="create_config.py") /var/run/secrets/kubernetes.io/serviceaccount from default-token-gqrnk (ro) Conditions: Type Status PodScheduled False Volumes: ipmount: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-configmap Optional: false create-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-create-config Optional: false default-token-gqrnk: Type: Secret (a volume populated by a Secret) SecretName: default-token-gqrnk Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Warning FailedScheduling 2m29s (x14539 over 15d) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. kube_pod_output.log

egamor18 commented 3 years ago

I have followed the installations instructions for v2.0 (Ubuntu 18.04) and v3.0 (Ubuntu 20.04) but none seems to work. I always end up with the crashloopback error. I tried using the ansible playbooks and the install guides. Please whats the way forward?

kubectl get pods --all-namespaces

NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1614327265-node-feature-discovery-master-587bjzfdk 1/1 Running 0 3h55m default gpu-operator-1614327265-node-feature-discovery-worker-8k7dr 0/1 CrashLoopBackOff 42 3h55m default gpu-operator-6f5d966ff8-dqxmc 0/1 CrashLoopBackOff 50 3h55m kube-system calico-kube-controllers-59877c7fb4-jlpzp 1/1 Running 0 20h kube-system calico-node-f59xx 0/1 Running 346 20h kube-system coredns-66bff467f8-rdntv 1/1 Running 0 20h kube-system coredns-66bff467f8-sq5q2 1/1 Running 0 20h kube-system etcd-testpc 1/1 Running 0 20h kube-system kube-apiserver-testpc 1/1 Running 0 20h kube-system kube-controller-manager-testpc 1/1 Running 0 20h kube-system kube-proxy-z5r4n 1/1 Running 0 20h kube-system kube-scheduler-testpc 1/1 Running 0 20h

kubectl describe -n default pod gpu-operator-1614327265-node-feature-discovery-worker-8k7dr

`Name: gpu-operator-1614327265-node-feature-discovery-worker-8k7dr Namespace: default Priority: 0 Node: testpc/172.19.0.65 Start Time: Fri, 26 Feb 2021 09:14:27 +0100 Labels: app.kubernetes.io/component=worker app.kubernetes.io/instance=gpu-operator-1614327265 app.kubernetes.io/name=node-feature-discovery controller-revision-hash=76c746b548 pod-template-generation=1 Annotations: cni.projectcalico.org/podIP: 192.168.101.78/32 cni.projectcalico.org/podIPs: 192.168.101.78/32 Status: Running IP: 192.168.101.78 IPs: IP: 192.168.101.78 Controlled By: DaemonSet/gpu-operator-1614327265-node-feature-discovery-worker Containers: node-feature-discovery-master: Container ID: docker://3d37c360a0d022ee264074997da4f2cf3a701bacd95f96a363a47220953e4978 Image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0 Image ID: docker-pullable://quay.io/kubernetes_incubator/node-feature-discovery@sha256:a1e72dbc35a16cbdcf0007fc4fb207bce723ff67c61853d2d8d8051558ce6de7 Port: Host Port: Command: nfd-worker Args: --sleep-interval=60s --server=gpu-operator-1614327265-node-feature-discovery:8080 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Fri, 26 Feb 2021 13:05:52 +0100 Finished: Fri, 26 Feb 2021 13:06:52 +0100 Ready: False Restart Count: 42 Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /etc/kubernetes/node-feature-discovery/ from nfd-worker-config (rw) /etc/kubernetes/node-feature-discovery/features.d/ from features-d (rw) /etc/kubernetes/node-feature-discovery/source.d/ from source-d (rw) /host-boot from host-boot (ro) /host-etc/os-release from host-os-release (ro) /host-sys from host-sys (rw) /var/run/secrets/kubernetes.io/serviceaccount from gpu-operator-1614327265-node-feature-discovery-token-nk2wz (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: host-boot: Type: HostPath (bare host directory volume) Path: /boot HostPathType:
host-os-release: Type: HostPath (bare host directory volume) Path: /etc/os-release HostPathType:
host-sys: Type: HostPath (bare host directory volume) Path: /sys HostPathType:
source-d: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/node-feature-discovery/source.d/ HostPathType:
features-d: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/node-feature-discovery/features.d/ HostPathType:
nfd-worker-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: gpu-operator-1614327265-node-feature-discovery Optional: false gpu-operator-1614327265-node-feature-discovery-token-nk2wz: Type: Secret (a volume populated by a Secret) SecretName: gpu-operator-1614327265-node-feature-discovery-token-nk2wz Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/pid-pressure:NoSchedule node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule nvidia.com/gpu=present:NoSchedule Events: Type Reason Age From Message


Warning BackOff 9s (x885 over 3h53m) kubelet, testpc Back-off restarting failed container `

angudadevops commented 3 years ago

@angudadevops Thanks for the response!

Here is the output of(attached as log file as well): kubectl describe pod video-analytics-demo-0-1612379306-6559f4766d-kckf4

Name: video-analytics-demo-0-1612379306-6559f4766d-kckf4 Namespace: default Priority: 0 Node: Labels: app.kubernetes.io/instance=video-analytics-demo-0-1612379306 app.kubernetes.io/name=video-analytics-demo pod-template-hash=6559f4766d Annotations: rollme: 39cMH Status: Pending IP: IPs: Controlled By: ReplicaSet/video-analytics-demo-0-1612379306-6559f4766d Containers: video-analytics-demo-1: Image: nvcr.io/nvidia/deepstream:5.0-20.07-samples Ports: 8554/TCP, 5080/TCP Host Ports: 0/TCP, 0/TCP Command: sh -c apt update; apt install wget unzip -y; wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_trafficcamnet/versions/pruned_v1.0/zip -O tlt_trafficcamnet_pruned_v1.0.zip; unzip *.zip; cp -r resnet18_trafficcamnet_pruned.etlt /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/; sed -ie "s/../../models/tlt_pretrained_models/trafficcamnet/resnet18_trafficcamnet_pruned.etlt//opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/resnet18_trafficcamnet_pruned.etlt/g" /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/config_infer_primary_trafficcamnet.txt; python /opt/nvidia/deepstream/create_config.py deepstream-app /opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app/source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt Environment: Mounts: /etc/config from ipmount (rw) /opt/nvidia/deepstream/create_config.py from create-config (rw,path="create_config.py") /var/run/secrets/kubernetes.io/serviceaccount from default-token-gqrnk (ro) Conditions: Type Status PodScheduled False Volumes: ipmount: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-configmap Optional: false create-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-create-config Optional: false default-token-gqrnk: Type: Secret (a volume populated by a Secret) SecretName: default-token-gqrnk Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 2m29s (x14539 over 15d) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. kube_pod_output.log

@calebmm this looks like, issue with your master node. Run the below command to taint the master and node and retry the installation

kubectl taint nodes --all node-role.kubernetes.io/master-
angudadevops commented 3 years ago

CrashLoopBackOff

@egamor18, please follow the below steps and let us know if you still face the same issue or use the latest version of GPU Operator from 3.1 Guide.

1. Uninstall the GPU Operator 
2. reboot the node 
3. verify docker default run time in `/etc/docker/daemon.json`
4. if you see `"default-runtime" : "nvidia"` in `/etc/docker/daemon.json`, please remove the line and restart the docker service 
5. Now install the GPU Operator 
egamor18 commented 3 years ago

CrashLoopBackOff

@egamor18, please follow the below steps and let us know if you still face the same issue or use the latest version of GPU Operator from 3.1 Guide.

1. Uninstall the GPU Operator 
2. reboot the node 
3. verify docker default run time in `/etc/docker/daemon.json`
4. if you see `"default-runtime" : "nvidia"` in `/etc/docker/daemon.json`, please remove the line and restart the docker service 
5. Now install the GPU Operator 

Thank you @angudadevops . I tried with the version 3.1 as soon as I saw it and got a similar error (I will try it again later). Now upon seeing this response, i re-installed Ubuntu and tried version 3.0 again but without success. I removed the line "default-runtime" : "nvidia" in the /etc/docker/daemon.jsonand saved it. I restarted docker withsudo systemctl restart docker. When i checked after a few minutes, "default-runtime" : "nvidia" re-appears in the daeman.json file again. sudo pkill -SIGHUP dockerd and restarting the docker container didnt help either. What should I do?

egamor18 commented 3 years ago

CrashLoopBackOff

@egamor18, please follow the below steps and let us know if you still face the same issue or use the latest version of GPU Operator from 3.1 Guide.

1. Uninstall the GPU Operator 
2. reboot the node 
3. verify docker default run time in `/etc/docker/daemon.json`
4. if you see `"default-runtime" : "nvidia"` in `/etc/docker/daemon.json`, please remove the line and restart the docker service 
5. Now install the GPU Operator 

Thank you @angudadevops . I tried with the version 3.1 as soon as I saw it uploaded and got a similar error (I will try it again later). Now upon seeing this response, i re-installed Ubuntu and tried version 3.0 again. It is not resolved after following this instruction. I first had the problem of the line "default-runtime" : "nvidia" in /etc/docker/daemon.json re-appearing. I solved it by uninstalling docker2 restarting docker, uninstalling the gpu-operator and restarting the machine. Checking that that line of text is not back and then reinstalling the gpu operator. But no success.

egamor18 commented 3 years ago

This is for version 3.1 kubectl get pods --all-namespaces . The results is: NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1615369242-node-feature-discovery-master-77b6vm7pj 1/1 Running 3 33m default gpu-operator-1615369242-node-feature-discovery-worker-wcm45 1/1 Running 3 33m default gpu-operator-576f984b45-cnzdm 1/1 Running 4 33m gpu-operator-resources nvidia-container-toolkit-daemonset-55jf8 0/1 Init:0/1 4 32m gpu-operator-resources nvidia-driver-daemonset-zxh7q 0/1 CrashLoopBackOff 13 32m kube-system calico-kube-controllers-54658cf6f7-gbwmc 1/1 Running 3 43m kube-system calico-node-swj8c 1/1 Running 4 43m kube-system coredns-66bff467f8-m2kkx 1/1 Running 4 49m kube-system coredns-66bff467f8-q57n7 1/1 Running 4 49m kube-system etcd-testpc 1/1 Running 4 50m kube-system kube-apiserver-testpc 1/1 Running 5 50m kube-system kube-controller-manager-testpc 1/1 Running 4 50m kube-system kube-proxy-2mvbm 1/1 Running 5 49m kube-system kube-scheduler-testpc 1/1 Running 4 50m

The description of the errors: kubectl describe -n gpu-operator-resources pod nvidia-driver-daemonset-zxh7q . The results is:

`Name: nvidia-driver-daemonset-zxh7q Namespace: gpu-operator-resources Priority: 0 Node: testpc/172.19.0.65 Start Time: Wed, 10 Mar 2021 10:41:09 +0100 Labels: app=nvidia-driver-daemonset controller-revision-hash=5cdf5f5997 pod-template-generation=1 Annotations: cni.projectcalico.org/podIP: 192.168.101.96/32 cni.projectcalico.org/podIPs: 192.168.101.96/32 scheduler.alpha.kubernetes.io/critical-pod: Status: Running IP: 192.168.101.96 IPs: IP: 192.168.101.96 Controlled By: DaemonSet/nvidia-driver-daemonset Containers: nvidia-driver-ctr: Container ID: docker://5e808a5f1d574b000cbca6ed5cfa42a8eefc715c7a4749151e67518df8fff79d Image: nvcr.io/nvidia/driver:460.32.03-ubuntu20.04 Image ID: docker-pullable://nvcr.io/nvidia/driver@sha256:8a1d9a3c790ad93c67359177c3c8e690a0e2445e1a372e024346db429a58a086 Port: Host Port: Command: nvidia-driver Args: init State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 10 Mar 2021 11:11:46 +0100 Finished: Wed, 10 Mar 2021 11:12:03 +0100 Ready: False Restart Count: 13 Environment: Mounts: /dev/log from dev-log (rw) /etc/containers/oci/hooks.d from config (rw) /run/nvidia from run-nvidia (rw) /var/log from var-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from nvidia-driver-token-pb5fr (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: run-nvidia: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType:
var-log: Type: HostPath (bare host directory volume) Path: /var/log HostPathType:
dev-log: Type: HostPath (bare host directory volume) Path: /dev/log HostPathType:
config: Type: ConfigMap (a volume populated by a ConfigMap) Name: nvidia-driver Optional: false nvidia-driver-token-pb5fr: Type: Secret (a volume populated by a Secret) SecretName: nvidia-driver-token-pb5fr Optional: false QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.present=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/pid-pressure:NoSchedule node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule nvidia.com/gpu:NoSchedule Events: Type Reason Age From Message


Normal Scheduled 34m default-scheduler Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-zxh7q to testpc Normal Pulling 34m kubelet Pulling image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" Normal Pulled 34m kubelet Successfully pulled image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" Normal Created 31m (x5 over 34m) kubelet Created container nvidia-driver-ctr Normal Started 31m (x5 over 34m) kubelet Started container nvidia-driver-ctr Normal Pulled 31m (x4 over 34m) kubelet Container image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" already present on machine Warning BackOff 19m (x55 over 33m) kubelet Back-off restarting failed container Warning FailedMount 12m kubelet MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition Normal SandboxChanged 11m (x2 over 12m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 9m5s (x4 over 11m) kubelet Container image "nvcr.io/nvidia/driver:460.32.03-ubuntu20.04" already present on machine Normal Created 9m4s (x4 over 11m) kubelet Created container nvidia-driver-ctr Normal Started 9m4s (x4 over 11m) kubelet Started container nvidia-driver-ctr Warning BackOff 2m (x35 over 11m) kubelet Back-off restarting failed container `

kubectl describe -n gpu-operator-resources pod nvidia-container-toolkit-daemonset-55jf8 . The result is:

`Name: nvidia-container-toolkit-daemonset-55jf8 Namespace: gpu-operator-resources Priority: 0 Node: testpc/172.19.0.65 Start Time: Wed, 10 Mar 2021 10:41:49 +0100 Labels: app=nvidia-container-toolkit-daemonset controller-revision-hash=5bd8bf679d pod-template-generation=1 Annotations: cni.projectcalico.org/podIP: 192.168.101.97/32 cni.projectcalico.org/podIPs: 192.168.101.97/32 scheduler.alpha.kubernetes.io/critical-pod: Status: Pending IP: 192.168.101.97 IPs: IP: 192.168.101.97 Controlled By: DaemonSet/nvidia-container-toolkit-daemonset Init Containers: driver-validation: Container ID: docker://fdb1ee5ffa1bfafbefd5f955e9f1b73352d6b30ab022994ba25633cfaea45d99 Image: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 Image ID: docker-pullable://nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 Port: Host Port: Command: sh -c Args: export SYS_LIBRARY_PATH=$(ldconfig -v 2>/dev/null grep -v '^[[:space:]]' cut -d':' -f1 tr '[[:space:]]' ':'); export NVIDIA_LIBRARY_PATH=/run/nvidia/driver/usr/lib/x86_64-linux-gnu/:/run/nvidia/driver/usr/lib64; export LD_LIBRARY_PATH=${SYS_LIBRARY_PATH}:${NVIDIA_LIBRARY_PATH}; echo ${LD_LIBRARY_PATH}; export PATH=/run/nvidia/driver/usr/bin/:${PATH}; until nvidia-smi; do echo waiting for nvidia drivers to be loaded; sleep 5; done State: Running Started: Wed, 10 Mar 2021 11:04:42 +0100 Ready: False Restart Count: 4 Environment: Mounts: /run/nvidia from nvidia-install-path (rw) /var/run/secrets/kubernetes.io/serviceaccount from nvidia-container-toolkit-token-hzl87 (ro) Containers: nvidia-container-toolkit-ctr: Container ID:
Image: nvcr.io/nvidia/k8s/container-toolkit:1.4.5-ubuntu18.04 Image ID:
Port: Host Port: Args: /usr/local/nvidia State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: RUNTIME_ARGS: --socket /var/run/docker.sock --config /etc/docker/daemon.json RUNTIME: docker Mounts: /etc/docker/ from docker-config (rw) /run/nvidia from nvidia-install-path (rw) /usr/local/nvidia from nvidia-local (rw) /usr/share/containers/oci/hooks.d from crio-hooks (rw) /var/run/ from docker-socket (rw) /var/run/secrets/kubernetes.io/serviceaccount from nvidia-container-toolkit-token-hzl87 (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: nvidia-install-path: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType:
nvidia-local: Type: HostPath (bare host directory volume) Path: /usr/local/nvidia HostPathType:
crio-hooks: Type: HostPath (bare host directory volume) Path: /etc/containers/oci/hooks.d HostPathType:
docker-config: Type: HostPath (bare host directory volume) Path: /etc/docker HostPathType:
docker-socket: Type: HostPath (bare host directory volume) Path: /var/run HostPathType:
nvidia-container-toolkit-token-hzl87: Type: Secret (a volume populated by a Secret) SecretName: nvidia-container-toolkit-token-hzl87 Optional: false QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.present=true Tolerations: CriticalAddonsOnly node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/pid-pressure:NoSchedule node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule nvidia.com/gpu:NoSchedule Events: Type Reason Age From Message

Normal Scheduled 50m default-scheduler Successfully assigned gpu-operator-resources/nvidia-container-toolkit-daemonset-55jf8 to testpc Normal Pulling 50m kubelet Pulling image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" Normal Pulled 50m kubelet Successfully pulled image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" Normal SandboxChanged 38m (x11 over 43m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Created 38m (x4 over 50m) kubelet Created container driver-validation Normal Started 38m (x4 over 50m) kubelet Started container driver-validation Normal Pulled 38m (x3 over 42m) kubelet Container image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" already present on machine Warning FailedMount 28m kubelet MountVolume.SetUp failed for volume "nvidia-container-toolkit-token-hzl87" : failed to sync secret cache: timed out waiting for the condition Normal SandboxChanged 27m (x2 over 28m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 27m kubelet Container image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59" already present on machine Normal Created 27m kubelet Created container driver-validation Normal Started 27m kubelet Started container driver-validation `

the content of the daemon.json is:

{ "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } Installed the container runtime with:

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

angudadevops commented 3 years ago

NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1615369242-node-feature-discovery-master-77b6vm7pj 1/1 Running 3 33m default gpu-operator-1615369242-node-feature-discovery-worker-wcm45 1/1 Running 3 33m default gpu-operator-576f984b45-cnzdm 1/1 Running 4 33m gpu-operator-resources nvidia-container-toolkit-daemonset-55jf8 0/1 Init:0/1 4 32m gpu-operator-resources nvidia-driver-daemonset-zxh7q 0/1 CrashLoopBackOff 13 32m kube-system calico-kube-controllers-54658cf6f7-gbwmc 1/1 Running 3 43m kube-system calico-node-swj8c 1/1 Running 4 43m kube-system coredns-66bff467f8-m2kkx 1/1 Running 4 49m kube-system coredns-66bff467f8-q57n7 1/1 Running 4 49m kube-system etcd-testpc 1/1 Running 4 50m kube-system kube-apiserver-testpc 1/1 Running 5 50m kube-system kube-controller-manager-testpc 1/1 Running 4 50m kube-system kube-proxy-2mvbm 1/1 Running 5 49m kube-system kube-scheduler-testpc 1/1 Running 4 50m

@egamor18 looks like driver-daemonset failed to run, this is issue with GPU Operator not 3.1 stack. can you please let us know what's the GPU you're using and also please check driver-daemonset logs.

egamor18 commented 3 years ago

For nvidia-smi :

Tue Mar 16 14:55:46 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 6000 Off | 00000000:AF:00.0 On | Off | | 35% 28C P8 19W / 260W | 324MiB / 24217MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1478 G /usr/lib/xorg/Xorg 39MiB | | 0 N/A N/A 14527 G /usr/lib/xorg/Xorg 121MiB | | 0 N/A N/A 14747 G /usr/bin/gnome-shell 33MiB | | 0 N/A N/A 31237 G ...AAAAAAAAA= --shared-files 119MiB | +-----------------------------------------------------------------------------+ For the logs kubectl logs -n gpu-operator-resources -p nvidia-driver-daemonset-s9rld :

`Creating directory NVIDIA-Linux-x86_64-460.32.03 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.8.0-44-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use`

angudadevops commented 3 years ago

For nvidia-smi :

Tue Mar 16 14:55:46 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 6000 Off | 00000000:AF:00.0 On | Off | | 35% 28C P8 19W / 260W | 324MiB / 24217MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1478 G /usr/lib/xorg/Xorg 39MiB | | 0 N/A N/A 14527 G /usr/lib/xorg/Xorg 121MiB | | 0 N/A N/A 14747 G /usr/bin/gnome-shell 33MiB | | 0 N/A N/A 31237 G ...AAAAAAAAA= --shared-files 119MiB | +-----------------------------------------------------------------------------+` For the logskubectl logs -n gpu-operator-resources -p nvidia-driver-daemonset-s9rld` :

`Creating directory NVIDIA-Linux-x86_64-460.32.03 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.8.0-44-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Could not unload NVIDIA driver kernel modules, driver is in use`

@egamor18 , if you noticed nvidia-driver kernel is in use. please unload the kernel modules and reboot the instance. If you encounter any more issues related to GPU Operator. Please raise an issue with https://github.com/NVIDIA/gpu-operator/issues.

angudadevops commented 3 years ago

@angudadevops Thanks for the response!

Here is the output of(attached as log file as well): kubectl describe pod video-analytics-demo-0-1612379306-6559f4766d-kckf4

Name: video-analytics-demo-0-1612379306-6559f4766d-kckf4 Namespace: default Priority: 0 Node: Labels: app.kubernetes.io/instance=video-analytics-demo-0-1612379306 app.kubernetes.io/name=video-analytics-demo pod-template-hash=6559f4766d Annotations: rollme: 39cMH Status: Pending IP: IPs: Controlled By: ReplicaSet/video-analytics-demo-0-1612379306-6559f4766d Containers: video-analytics-demo-1: Image: nvcr.io/nvidia/deepstream:5.0-20.07-samples Ports: 8554/TCP, 5080/TCP Host Ports: 0/TCP, 0/TCP Command: sh -c apt update; apt install wget unzip -y; wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_trafficcamnet/versions/pruned_v1.0/zip -O tlt_trafficcamnet_pruned_v1.0.zip; unzip *.zip; cp -r resnet18_trafficcamnet_pruned.etlt /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/; sed -ie "s/../../models/tlt_pretrained_models/trafficcamnet/resnet18_trafficcamnet_pruned.etlt//opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/resnet18_trafficcamnet_pruned.etlt/g" /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/config_infer_primary_trafficcamnet.txt; python /opt/nvidia/deepstream/create_config.py deepstream-app /opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app/source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt Environment: Mounts: /etc/config from ipmount (rw) /opt/nvidia/deepstream/create_config.py from create-config (rw,path="create_config.py") /var/run/secrets/kubernetes.io/serviceaccount from default-token-gqrnk (ro) Conditions: Type Status PodScheduled False Volumes: ipmount: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-configmap Optional: false create-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: video-analytics-demo-0-1612379306-create-config Optional: false default-token-gqrnk: Type: Secret (a volume populated by a Secret) SecretName: default-token-gqrnk Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 2m29s (x14539 over 15d) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. kube_pod_output.log

@calebmm @kazuf3 are you encountering the same issue. are you able to run nvidia-smi pod ?

egamor18 commented 3 years ago

please unload the kernel modules @angudadevops, can you please show me how to do this? Thank you.

angudadevops commented 3 years ago

@calebmm @kazuf3 as there's no response closing this issue.