Open PrachiMittal2016 opened 9 months ago
gpu-operator-1701120700-node-feature-discovery-master pod is crashing with below error:
Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html
Approach 1: Create AKS Cluster with Node Pool Tags to Prevent Driver installation
Create Nodepool without GPU Driver
az aks nodepool add --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --node-count 1 --node-vm-size Standard_NC4as_T4_v3 --node-taints sku=gpu:NoSchedule --labels sku=gpu --node-osdisk-type Ephemeral --enable-cluster-autoscaler --tags SkipGPUDriverInstall=true --os-type Linux --min-count 1 --max-count 1
az aks nodepool show --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --query tags { "SkipGPUDriverInstall": "true" }
Install NVIDIA GPU Operator
helm version version.BuildInfo{Version:"v3.10.3", GitCommit:"835b7334cfe2e5e27870ab3ed4135f136eecc704", GitTreeState:"clean", GoVersion:"go1.18.9"}
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \&&helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace \nvidia/gpu-operator
helm list --namespace gpu-operator NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION gpu-operator-1701120700 gpu-operator 1 2023-11-27 16:31:43.779255 -0500 EST failed gpu-operator-v23.9.0 v23.9.0
kubernetes pods status:
kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-1701120700-node-feature-discovery-gc-779cf9cfjsmfc 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-master-d46872jxf 0/1 CrashLoopBackOff 31 (4m16s ago) 90m gpu-operator-1701120700-node-feature-discovery-worker-4w9tm 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-76b58 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-8zxk6 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-c6xnb 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-l5js7 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-wfq95 1/1 Running 0 90m gpu-operator-1701120700-node-feature-discovery-worker-zkvct 1/1 Running 0 90m gpu-operator-75dc4c6dd6-prjdj 1/1 Running 0 90m
kubernetes daemonset status:
kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-operator-1701120700-node-feature-discovery-worker 7 7 7 7 7
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod gpu-operator-1701120700-node-feature-discovery-master-d46872jxf -n gpu-operator
Name: gpu-operator-1701120700-node-feature-discovery-master-d46872jxf
Namespace: gpu-operator
Priority: 0
Service Account: node-feature-discovery
Node: aks-micservices-74078279-vmss000008/10.240.0.11
Start Time: Mon, 27 Nov 2023 16:31:49 -0500
Labels: app.kubernetes.io/instance=gpu-operator-1701120700
app.kubernetes.io/name=node-feature-discovery
pod-template-hash=d468b9bc
role=master
Annotations:
Warning Unhealthy 31m (x69 over 91m) kubelet Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out Warning Unhealthy 6m37s (x118 over 91m) kubelet Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out Warning BackOff 103s (x335 over 87m) kubelet Back-off restarting failed container
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
kubectl logs -n gpu-operator gpu-operator-1701120700-node-feature-discovery-master-d46872jxf --all-containers I1127 23:03:42.256769 1 main.go:83] "-port is deprecated, will be removed in a future release along with the deprecated gRPC API" I1127 23:03:42.256879 1 nfd-master.go:213] "Node Feature Discovery Master" version="v0.14.2" nodeName="aks-micservices-74078279-vmss000008" namespace="gpu-operator" I1127 23:03:42.257112 1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf" I1127 23:03:42.257462 1 nfd-master.go:1274] "configuration successfully updated" configuration=< DenyLabelNs: {} EnableTaints: false ExtraLabelNs: nvidia.com: {} Klog: {} LabelWhiteList: {} LeaderElection: LeaseDuration: Duration: 15000000000 RenewDeadline: Duration: 10000000000 RetryPeriod: Duration: 2000000000 NfdApiParallelism: 10 NoPublish: false ResourceLabels: {} ResyncPeriod: Duration: 3600000000000
I1127 23:03:42.257487 1 nfd-master.go:1338] "starting the nfd api controller" I1127 23:03:42.257716 1 node-updater-pool.go:79] "starting the NFD master node updater pool" parallelism=10 I1127 23:03:42.286242 1 metrics.go:115] "metrics server starting" port=8081 I1127 23:03:42.286360 1 component.go:36] [core][Server #1] Server created I1127 23:03:42.286399 1 nfd-master.go:347] "gRPC server serving" port=8080 I1127 23:03:42.286465 1 component.go:36] [core][Server #1 ListenSocket #2] ListenSocket created I1127 23:03:43.286714 1 nfd-master.go:694] "will process all nodes in the cluster"
nvidia-smi
from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
I am not sure which pod it isjournalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
@ArangoGutierrez any thoughts?
gpu-operator-issue1.txt