GPU-operator install fails - NFD master pod crash , Probes are failing #619

PrachiMittal2016 commented 9 months ago


1. Quick Debug Information

2. Issue or feature description

gpu-operator-1701120700-node-feature-discovery-master pod is crashing with below error:

Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out

3. Steps to reproduce the issue


Approach 1: Create AKS Cluster with Node Pool Tags to Prevent Driver installation

Create Nodepool without GPU Driver

az aks nodepool add --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --node-count 1 --node-vm-size Standard_NC4as_T4_v3 --node-taints sku=gpu:NoSchedule --labels sku=gpu --node-osdisk-type Ephemeral --enable-cluster-autoscaler --tags SkipGPUDriverInstall=true --os-type Linux --min-count 1 --max-count 1

az aks nodepool show --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --query tags { "SkipGPUDriverInstall": "true" }

Install NVIDIA GPU Operator


helm version version.BuildInfo{Version:"v3.10.3", GitCommit:"835b7334cfe2e5e27870ab3ed4135f136eecc704", GitTreeState:"clean", GoVersion:"go1.18.9"}

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \&&helm repo update

helm install --wait --generate-name -n gpu-operator --create-namespace \nvidia/gpu-operator

helm list --namespace gpu-operator NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION gpu-operator-1701120700 gpu-operator 1 2023-11-27 16:31:43.779255 -0500 EST failed gpu-operator-v23.9.0 v23.9.0

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status:

kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-1701120700-node-feature-discovery-gc-779cf9cfjsmfc 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-master-d46872jxf 0/1 CrashLoopBackOff 31 (4m16s ago) 90m gpu-operator-1701120700-node-feature-discovery-worker-4w9tm 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-76b58 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-8zxk6 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-c6xnb 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-l5js7 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-wfq95 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-zkvct 1/1 Running 0 90m gpu-operator-75dc4c6dd6-prjdj 1/1 Running 0 90m

kubernetes daemonset status:

kubectl get ds -n gpu-operator NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-operator-1701120700-node-feature-discovery-worker 7 7 7 7 7 91m

kubectl describe pod gpu-operator-1701120700-node-feature-discovery-master-d46872jxf -n gpu-operator Name: gpu-operator-1701120700-node-feature-discovery-master-d46872jxf Namespace: gpu-operator Priority: 0 Service Account: node-feature-discovery Node: aks-micservices-74078279-vmss000008/ Start Time: Mon, 27 Nov 2023 16:31:49 -0500 Labels: app.kubernetes.io/instance=gpu-operator-1701120700 app.kubernetes.io/name=node-feature-discovery pod-template-hash=d468b9bc role=master Annotations: Status: Running IP: IPs: IP: Controlled By: ReplicaSet/gpu-operator-1701120700-node-feature-discovery-master-d468b9bc Containers: master: Container ID: containerd://596c5399b024736718c29776c9f6f10b9927eb7bbfdefe004d9e7e479c3acd34 Image: registry.k8s.io/nfd/node-feature-discovery:v0.14.2 Image ID: registry.k8s.io/nfd/node-feature-discovery@sha256:2a56d172c48b76531eb719780224ef278daa68b9088f592f16df2519bed08de4 Ports: 8080/TCP, 8081/TCP Host Ports: 0/TCP, 0/TCP Command: nfd-master Args: -port=8080 -crd-controller=true -metrics=8081 State: Running Started: Mon, 27 Nov 2023 18:03:08 -0500 Last State: Terminated Reason: Error Exit Code: 2 Started: Mon, 27 Nov 2023 17:57:23 -0500 Finished: Mon, 27 Nov 2023 17:58:01 -0500 Ready: False Restart Count: 32 Liveness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=10s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=5s timeout=1s period=10s #success=1 #failure=10 Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /etc/kubernetes/node-feature-discovery from nfd-master-conf (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v8tws (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: nfd-master-conf: Type: ConfigMap (a volume populated by a ConfigMap) Name: gpu-operator-1701120700-node-feature-discovery-master-conf Optional: false kube-api-access-v8tws: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node-role.kubernetes.io/control-plane:NoSchedule node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message

Warning Unhealthy 31m (x69 over 91m) kubelet Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out Warning Unhealthy 6m37s (x118 over 91m) kubelet Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out Warning BackOff 103s (x335 over 87m) kubelet Back-off restarting failed container

kubectl logs -n gpu-operator gpu-operator-1701120700-node-feature-discovery-master-d46872jxf --all-containers I1127 23:03:42.256769 1 main.go:83] "-port is deprecated, will be removed in a future release along with the deprecated gRPC API" I1127 23:03:42.256879 1 nfd-master.go:213] "Node Feature Discovery Master" version="v0.14.2" nodeName="aks-micservices-74078279-vmss000008" namespace="gpu-operator" I1127 23:03:42.257112 1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf" I1127 23:03:42.257462 1 nfd-master.go:1274] "configuration successfully updated" configuration=< DenyLabelNs: {} EnableTaints: false ExtraLabelNs: nvidia.com: {} Klog: {} LabelWhiteList: {} LeaderElection: LeaseDuration: Duration: 15000000000 RenewDeadline: Duration: 10000000000 RetryPeriod: Duration: 2000000000 NfdApiParallelism: 10 NoPublish: false ResourceLabels: {} ResyncPeriod: Duration: 3600000000000

I1127 23:03:42.257487 1 nfd-master.go:1338] "starting the nfd api controller" I1127 23:03:42.257716 1 node-updater-pool.go:79] "starting the NFD master node updater pool" parallelism=10 I1127 23:03:42.286242 1 metrics.go:115] "metrics server starting" port=8081 I1127 23:03:42.286360 1 component.go:36] [core][Server #1] Server created I1127 23:03:42.286399 1 nfd-master.go:347] "gRPC server serving" port=8080 I1127 23:03:42.286465 1 component.go:36] [core][Server #1 ListenSocket #2] ListenSocket created I1127 23:03:43.286714 1 nfd-master.go:694] "will process all nodes in the cluster"

