Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error
below are pod logs -- omitted the initial part and added only error logs.
'[' '' '!=' builtin ']'
Updating the package cache...
echo 'Updating the package cache...'
yum -q makecache
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
FATAL: failed to reach RHEL package repositories. Ensure that the cluster can access the proper networks.
echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'
Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)
any help on this issue will be very much appreciated
1. Quick Debug Information
2. Issue or feature description
Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error below are pod logs -- omitted the initial part and added only error logs.
echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'
[ ] kubernetes pods status:
kubectl get pods -n gpu-operator
gpu-feature-discovery-zqm9h 0/1 Init:0/1 0 86m gpu-operator-1700756391-node-feature-discovery-gc-5c546559bfmj2 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-master-79796bzcb 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-6ddld 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-8c2k4 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-nzd7b 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-x8nx9 1/1 Running 0 93m gpu-operator-68d85f45d-v97fz 1/1 Running 0 93m nvidia-container-toolkit-daemonset-kqmtx 0/1 Init:0/1 0 86m nvidia-dcgm-exporter-5ncg7 0/1 Init:0/1 0 86m nvidia-device-plugin-daemonset-qmvhc 0/1 Init:0/1 0 86m nvidia-driver-daemonset-fwcvl 0/1 CrashLoopBackOff 19 (3m20s ago) 87m nvidia-operator-validator-vcztn 0/1 Init:0/4 0 86m[ ] kubernetes daemonset status: 94m
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 94m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 94m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 94m
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 94m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 94m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 94m
kubectl get ds -n OPERATOR_NAMESPACE
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 94m gpu-operator-1700756391-node-feature-discovery-worker 4 4 4 4 4[ ] If a pod/ds is in an error state or pending state
Host Port:
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 23 Nov 2023 11:26:22 -0500
Finished: Thu, 23 Nov 2023 11:26:54 -0500
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: false
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Containers:
nvidia-driver-ctr:
Container ID: cri-o://8139fed89018b0c4382884f44dfa1f7146711824baf3029b9b8b416e4e91c9f5
Image: nvcr.io/nvidia/driver:525.125.06-rhel8.6
Image ID: nvcr.io/nvidia/driver@sha256:b58167d31d34784cd7c425961234d67c5e2d22eb4a5312681d0337dae812f746
Port:
Host Port:
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 23 Nov 2023 12:49:50 -0500
Finished: Thu, 23 Nov 2023 12:50:24 -0500
Ready: False
Restart Count: 19
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
kube-api-access-qphz2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
k describe po nvidia-driver-daemonset-fwcvl Name: nvidia-driver-daemonset-fwcvl Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-driver Node: lab-worker-4/172.21.1.70 Start Time: Thu, 23 Nov 2023 11:26:21 -0500 Labels: app=nvidia-driver-daemonset app.kubernetes.io/component=nvidia-driver app.kubernetes.io/managed-by=gpu-operator controller-revision-hash=5954d75477 helm.sh/chart=gpu-operator-v23.9.0 nvidia.com/precompiled=false pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 14eb92fe162f5d1ddcf0d32343f0815ae1325dfca8eb88354d979f7cbc335c5d cni.projectcalico.org/podIP: 192.168.148.114/32 cni.projectcalico.org/podIPs: 192.168.148.114/32 kubectl.kubernetes.io/default-container: nvidia-driver-ctr Status: Running IP: 192.168.148.114 IPs: IP: 192.168.148.114 Controlled By: DaemonSet/nvidia-driver-daemonset Init Containers: k8s-driver-manager: Container ID: cri-o://b15e393c5603042c1938c49f132a706332ba76bb21dab6ea2d50a0fe2a0cf3b3 Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.4 Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7 Port:Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)
any help on this issue will be very much appreciated