1. Quick Debug Checklist

[X] Are you running on an Ubuntu 18.04 node?
```
Ubuntu 20.04.5 LTS
ESXi 7.03
4 A2 GPUs
```
[X] Are you running Kubernetes v1.13+? RKE2 v1.24.8+rke2r1
[X] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? contianerd 1.6.8-k3s1
[X] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
```
root@rke2-server:~# lsmod | egrep i2c_core
```

root@rke2-server:~# lsmod | grep -i ipmi_msghandler ipmi_msghandler 106496 1 ipmi_devintf

- [X] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`)

### 1. Issue or feature description
`nvidia-driver-ctr` run and then exits which causes `nvidia-driver-daemonset` to fail -- blocking the rest of the process form continuing as the step checks fail

### 2. Steps to reproduce the issue
Ubuntu 20.04.5 Server
Default Hardening: https://github.com/konstruktoid/hardening  `sudo bash ubuntu.sh`
RKE2 Install: https://docs.rke2.io/install/quickstart
[Nvidia GPU Operator Helm Install](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-helm)

Running behind a MITM proxy and having nodes setup with proxy/ca trust
- driver and helm chart have been configured to add proxy envs and `certConfig` for driver
- Public Registries are mirrored through a Proxy cache and each node is configured with containerd settings as such

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

#### Helm Config
> `helm upgrade -i gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespace -f values.yaml`
`values.yaml`

Default values for gpu-operator.

This is a YAML-formatted file.

Declare variables to be passed into your templates.

platform: openshift: false

nfd: enabled: true

psp: enabled: true

sandboxWorkloads: enabled: true defaultWorkload: "container"

daemonsets: priorityClassName: system-node-critical tolerations:

key: nvidia.com/gpu operator: Exists effect: NoSchedule

validator: repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: gpu-operator-validator version: "v22.9.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] args: [] resources: {} plugin: env:

name: WITH_WORKLOAD value: "true"

operator: repository: [readactedurl].com/ext.nvcr.io/nvidia image: gpu-operator version: "v22.9.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] priorityClassName: system-node-critical defaultRuntime: containerd runtimeClass: nvidia use_ocp_driver_toolkit: false

cleanup CRD on chart un-install

cleanupCRD: true

upgrade CRD on chart upgrade, requires --disable-openapi-validation flag

to be passed during helm upgrade.

upgradeCRD: true initContainer: image: cuda repository: [readactedurl].com/ext.nvcr.io/nvidia version: 11.7.1-base-ubuntu20.04 imagePullPolicy: IfNotPresent tolerations:

key: "node-role.kubernetes.io/master" operator: "Equal" value: "" effect: "NoSchedule" annotations: openshift.io/scc: restricted-readonly affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1 preference: matchExpressions:
  - key: "node-role.kubernetes.io/master" operator: In values: [""] logging: timeEncoding: epoch resources: limits: cpu: 500m memory: 350Mi requests: cpu: 200m memory: 100Mi

mig: strategy: single

driver: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: driver version: "515-signed" imagePullPolicy: IfNotPresent imagePullSecrets: [] rdma: enabled: false useHostMofed: false manager: image: k8s-driver-manager repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native version: v0.4.2 imagePullPolicy: IfNotPresent env:

name: ENABLE_AUTO_DRAIN value: "true"
name: DRAIN_USE_FORCE value: "false"
name: DRAIN_POD_SELECTOR_LABEL value: ""
name: DRAIN_TIMEOUT_SECONDS value: "0s"
name: DRAIN_DELETE_EMPTYDIR_DATA value: "false" env:
- name: HTTPS_PROXY value: http://[readactedurl].com:80
- name: HTTP_PROXY value: http://[readactedurl].com:80
- name: NO_PROXY value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local
- name: https_proxy value: http://[readactedurl].com:80
- name: http_proxy value: http://[readactedurl].com:80
- name: no_proxy value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local resources: {}
  Private mirror repository configuration
  
  repoConfig: configMapName: ""
  
  custom ssl key/certificate configuration
  
  certConfig: name: "cert-config"
  
  vGPU licensing configuration
  
  licensingConfig: configMapName: "" nlsEnabled: false
  
  vGPU topology daemon configuration
  
  virtualTopology: config: ""
  
  kernel module configuration for NVIDIA driver
  
  kernelModuleConfig: name: ""
  
  configuration for controlling rolling update of NVIDIA driver DaemonSet pods
  
  rollingUpdate:
  
  maximum number of nodes to simultaneously apply pod updates on.
  
  can be specified either as number or percentage of nodes. Default 1.
  
  maxUnavailable: "1"

toolkit: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia/k8s image: container-toolkit version: v1.11.0-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env:

name: CONTAINERD_CONFIG value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
name: CONTAINERD_SOCKET value: /run/k3s/containerd/containerd.sock
name: CONTAINERD_RUNTIME_CLASS value: nvidia
name: CONTAINERD_SET_AS_DEFAULT value: "true"
name: HTTPS_PROXY value: http://[readactedurl].com:80
name: HTTP_PROXY value: http://[readactedurl].com:80
name: NO_PROXY value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local
name: https_proxy value: http://[readactedurl].com:80
name: http_proxy value: http://[readactedurl].com:80
name: no_proxy value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local resources: {} installDir: "/usr/local/nvidia"

devicePlugin: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: k8s-device-plugin version: v0.12.3-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] args: [] env:

name: PASS_DEVICE_SPECS value: "true"
name: FAIL_ON_INIT_ERROR value: "true"
name: DEVICE_LIST_STRATEGY value: envvar
name: DEVICE_ID_STRATEGY value: uuid
name: NVIDIA_VISIBLE_DEVICES value: all
name: NVIDIA_DRIVER_CAPABILITIES value: all
name: HTTPS_PROXY value: http://[readactedurl].com:80
name: HTTP_PROXY value: http://[readactedurl].com:80
name: NO_PROXY value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local
name: https_proxy value: http://[readactedurl].com:80
name: http_proxy value: http://[readactedurl].com:80
name: no_proxy value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local resources: {}
Plugin configuration

config: default: |- version: v1 flags: migStrategy: "none" failOnInitError: true nvidiaDriverRoot: "/run/nvidia/driver" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid

standalone dcgm hostengine

dcgm:

disabled by default to use embedded nv-hostengine by exporter

enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: dcgm version: 3.0.4-1-ubuntu20.04 imagePullPolicy: IfNotPresent hostPort: 5555 args: [] env: [] resources: {}

dcgmExporter: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia/k8s image: dcgm-exporter version: 3.0.4-3.0.0-ubuntu20.04 imagePullPolicy: IfNotPresent env:

name: DCGM_EXPORTER_LISTEN value: ":9400"
name: DCGM_EXPORTER_KUBERNETES value: "true"
name: DCGM_EXPORTER_COLLECTORS value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
name: HTTPS_PROXY value: http://[readactedurl].com:80
name: HTTP_PROXY value: http://[readactedurl].com:80
name: NO_PROXY value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local
name: https_proxy value: http://[readactedurl].com:80
name: http_proxy value: http://[readactedurl].com:80
name: no_proxy value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local resources: {} serviceMonitor: enabled: false interval: 15s honorLabels: false additionalLabels: {}

gfd: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: gpu-feature-discovery version: v0.7.0-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env:

name: GFD_SLEEP_INTERVAL value: 60s
name: GFD_FAIL_ON_INIT_ERROR value: "true"
name: HTTPS_PROXY value: http://[readactedurl].com:80
name: HTTP_PROXY value: http://[readactedurl].com:80
name: NO_PROXY value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local
name: https_proxy value: http://[readactedurl].com:80
name: http_proxy value: http://[readactedurl].com:80
name: no_proxy value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[readactedurl].com,.svc,.local resources: {}

migManager: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: k8s-mig-manager version: v0.5.0-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env:

name: WITH_REBOOT value: "false" resources: {} config: name: "" gpuClientsConfig: name: ""

nodeStatusExporter: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: gpu-operator-validator version: "v22.9.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] resources: {}

Experimental and only deploys nvidia-fs driver on Ubuntu

gds: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: nvidia-fs version: "515.43.04" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] args: []

vgpuManager: enabled: false repository: "" image: vgpu-manager version: "" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] resources: {} driverManager: image: k8s-driver-manager repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native version: v0.4.2 imagePullPolicy: IfNotPresent env:

name: ENABLE_AUTO_DRAIN value: "false"

vgpuDeviceManager: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: vgpu-device-manager version: "v0.2.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] config: name: "" default: "default"

vfioManager: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: cuda version: 11.7.1-base-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] resources: {} driverManager: image: k8s-driver-manager repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native version: v0.4.2 imagePullPolicy: IfNotPresent env:

name: ENABLE_AUTO_DRAIN value: "false"

sandboxDevicePlugin: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: kubevirt-gpu-device-plugin version: v1.2.1 imagePullPolicy: IfNotPresent imagePullSecrets: [] args: [] env: [] resources: {}

node-feature-discovery: worker: tolerations:

key: "node-role.kubernetes.io/master" operator: "Equal" value: "" effect: "NoSchedule"

key: "nvidia.com/gpu" operator: "Equal" value: "present" effect: "NoSchedule"

config: sources: pci: deviceClassWhitelist:

"02"
"0200"
"0207"
"0300"
"0302" deviceLabelFields:
vendor

master: extraLabelNs:

nvidia.com serviceAccount: name: node-feature-discovery

[X] kubernetes pods status: kubectl get pods --all-namespaces

root@rke2-server:~# kubectl get pods -n gpu-operator
NAME                                                          READY   STATUS             RESTARTS        AGE
gpu-feature-discovery-2z85b                                   0/1     Init:0/1           0               12m
gpu-operator-5b69464989-k99wd                                 1/1     Running            0               3h17m
gpu-operator-node-feature-discovery-master-84c7c7c6cf-kgf7l   1/1     Running            0               3h17m
gpu-operator-node-feature-discovery-worker-xxbjq              1/1     Running            0               13m
gpu-operator-node-feature-discovery-worker-z2wjq              1/1     Running            5 (3h16m ago)   4h16m
nvidia-container-toolkit-daemonset-pbk9t                      0/1     Init:0/1           0               12m
nvidia-dcgm-exporter-5j2p6                                    0/1     Init:0/1           0               12m
nvidia-device-plugin-daemonset-7kb5x                          0/1     Init:0/1           0               12m
nvidia-driver-daemonset-h646p                                 0/1     CrashLoopBackOff   6 (4m55s ago)   12m
nvidia-operator-validator-q27nk                               0/1     Init:0/4           0               11m

[X] kubernetes daemonset status: kubectl get ds --all-namespaces

root@rke2-server:~# kubectl get ds -n gpu-operator
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   18h
gpu-operator-node-feature-discovery-worker   2         2         2       2            2           <none>                                             18h
nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       18h
nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           18h
nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           18h
nvidia-driver-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                  18h
nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      18h
nvidia-sandbox-device-plugin-daemonset       0         0         0       0            0           nvidia.com/gpu.deploy.sandbox-device-plugin=true   18h
nvidia-sandbox-validator                     0         0         0       0            0           nvidia.com/gpu.deploy.sandbox-validator=true       18h
nvidia-vfio-manager                          0         0         0       0            0           nvidia.com/gpu.deploy.vfio-manager=true            18h

[X] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

root@rke2-server:~# kubectl describe pod nvidia-driver-daemonset-h646p -n gpu-operator
Name:                 nvidia-driver-daemonset-h646p
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 rke2-agent.[redactedurl].com/192.168.100.40
Start Time:           Tue, 06 Dec 2022 17:11:18 +0000
Labels:               app=nvidia-driver-daemonset
              controller-revision-hash=54f98f84fc
              pod-template-generation=4
Annotations:          cni.projectcalico.org/containerID: 9db1188f347f20cae8bc46b4337e2d84b632e9a10d9d4ddb405b41746c0f5dd2
              cni.projectcalico.org/podIP: 10.42.61.71/32
              cni.projectcalico.org/podIPs: 10.42.61.71/32
              kubernetes.io/psp: global-unrestricted-psp
Status:               Running
IP:                   10.42.61.71
IPs:
IP:           10.42.61.71
Controlled By:  DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID:  containerd://781f17bbbffbbc4a684fe5e9fc80af77b7f03107def15619ebe17c831e5905ca
Image:         [redactedurl].com/ext.nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2
Image ID:      [redactedurl].com/ext.nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:e3f16c26b9340ed46aed248cc4d18353ba3a65886bf7a2f0cea25ff41b2553da
Port:          <none>
Host Port:     <none>
Command:
driver-manager
Args:
uninstall_driver
State:          Terminated
Reason:       Completed
Exit Code:    0
Started:      Tue, 06 Dec 2022 17:11:38 +0000
Finished:     Tue, 06 Dec 2022 17:11:40 +0000
Ready:          True
Restart Count:  0
Environment:
NODE_NAME:                    (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES:      void
ENABLE_AUTO_DRAIN:           true
DRAIN_USE_FORCE:             false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS:       0s
DRAIN_DELETE_EMPTYDIR_DATA:  false
OPERATOR_NAMESPACE:          gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7kdwc (ro)
Containers:
nvidia-driver-ctr:
Container ID:  containerd://43da49b77359fe68037e24b0aecf132e9cda1e760d01c7dde47fe797dabd0963
Image:         [redactedurl].com/ext.nvcr.io/nvidia/driver:515-signed-ubuntu20.04
Image ID:      [redactedurl].com/ext.nvcr.io/nvidia/driver@sha256:645216c8e88591f72188976ac4147d71e4e6a116364d9ab36ab6d1b9931ceb5c
Port:          <none>
Host Port:     <none>
Command:
nvidia-driver
Args:
init
State:          Waiting
Reason:       CrashLoopBackOff
Last State:     Terminated
Reason:       Error
Exit Code:    100
Started:      Tue, 06 Dec 2022 17:23:40 +0000
Finished:     Tue, 06 Dec 2022 17:23:47 +0000
Ready:          False
Restart Count:  7
Startup:        exec [sh -c lsmod | grep nvidia] delay=30s timeout=1s period=10s #success=1 #failure=60
Environment:
HTTPS_PROXY:  http://[redactedurl].com.com:80
HTTP_PROXY:   http://[redactedurl].com.com:80
NO_PROXY:     localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[redactedurl].com.com,.svc,.local
https_proxy:  http://[redactedurl].com.com:80
http_proxy:   http://[redactedurl].com.com:80
no_proxy:     localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.[redactedurl].com.com,.svc,.local
Mounts:
/dev/log from dev-log (rw)
/etc/ssl/certs/Combined_pem.crt from cert-config (ro,path="Combined_pem.crt")
/host-etc/os-release from host-os-release (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7kdwc (ro)
Conditions:
Type              Status
Initialized       True
Ready             False
ContainersReady   False
PodScheduled      True
Volumes:
run-nvidia:
Type:          HostPath (bare host directory volume)
Path:          /run/nvidia
HostPathType:  DirectoryOrCreate
var-log:
Type:          HostPath (bare host directory volume)
Path:          /var/log
HostPathType:
dev-log:
Type:          HostPath (bare host directory volume)
Path:          /dev/log
HostPathType:
host-os-release:
Type:          HostPath (bare host directory volume)
Path:          /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type:          HostPath (bare host directory volume)
Path:          /run/nvidia-topologyd
HostPathType:  DirectoryOrCreate
mlnx-ofed-usr-src:
Type:          HostPath (bare host directory volume)
Path:          /run/mellanox/drivers/usr/src
HostPathType:  DirectoryOrCreate
run-mellanox-drivers:
Type:          HostPath (bare host directory volume)
Path:          /run/mellanox/drivers
HostPathType:  DirectoryOrCreate
run-nvidia-validations:
Type:          HostPath (bare host directory volume)
Path:          /run/nvidia/validations
HostPathType:  DirectoryOrCreate
host-root:
Type:          HostPath (bare host directory volume)
Path:          /
HostPathType:
host-sys:
Type:          HostPath (bare host directory volume)
Path:          /sys
HostPathType:  Directory
cert-config:
Type:      ConfigMap (a volume populated by a ConfigMap)
Name:      cert-config
Optional:  false
kube-api-access-7kdwc:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                     node.kubernetes.io/not-ready:NoExecute op=Exists
                     node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                     node.kubernetes.io/unreachable:NoExecute op=Exists
                     node.kubernetes.io/unschedulable:NoSchedule op=Exists
                     nvidia.com/gpu:NoSchedule op=Exists
Events:
Type     Reason     Age                 From               Message
----     ------     ----                ----               -------
Normal   Scheduled  15m                 default-scheduler  Successfully assigned gpu-operator/nvidia-driver-daemonset-h646p to rke2-agent.[redactedurl].com
Normal   Pulling    15m                 kubelet            Pulling image "[redactedurl].com/ext.nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2"
Normal   Pulled     15m                 kubelet            Successfully pulled image "[redactedurl].com/ext.nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2" in 18.013196136s
Normal   Created    15m                 kubelet            Created container k8s-driver-manager
Normal   Started    15m                 kubelet            Started container k8s-driver-manager
Normal   Pulling    15m                 kubelet            Pulling image "[redactedurl].com/ext.nvcr.io/nvidia/driver:515-signed-ubuntu20.04"
Normal   Pulled     15m                 kubelet            Successfully pulled image "[redactedurl].com/ext.nvcr.io/nvidia/driver:515-signed-ubuntu20.04" in 26.409566424s
Normal   Created    13m (x4 over 15m)   kubelet            Created container nvidia-driver-ctr
Normal   Started    13m (x4 over 15m)   kubelet            Started container nvidia-driver-ctr
Normal   Pulled     13m (x3 over 14m)   kubelet            Container image "[redactedurl].com/ext.nvcr.io/nvidia/driver:515-signed-ubuntu20.04" already present on machine
Warning  BackOff    39s (x71 over 14m)  kubelet            Back-off restarting failed container

[X] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME


root@rke2-server:~# kubectl logs -f nvidia-driver-daemonset-h646p --all-containers=true -n gpu-operator
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-cjzm7 condition met
unbinding device 0000:03:00.0
unbinding device 0000:05:00.0
unbinding device 0000:0d:00.0
unbinding device 0000:16:00.0
Uncordoning node rke2-agent.[redactedurl].com...
node/rke2-agent.[redactedurl].com already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Unloading nouveau driver...
Successfully unloaded nouveau driver

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 515 for Linux kernel version 5.4.0-125-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Updating the package cache... E: Release file for http://us.archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease is not valid yet (invalid for another 3h 58min 53s). Updates for this repository will not be applied. E: Release file for http://us.archive.ubuntu.com/ubuntu/dists/focal-security/InRelease is not valid yet (invalid for another 3h 57min 43s). Updates for this repository will not be applied. E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease is not valid yet (invalid for another 3h 58min 52s). Updates for this repository will not be applied. E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease is not valid yet (invalid for another 3h 57min 42s). Updates for this repository will not be applied. Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...


 - [X] ~~Output of running a container on the GPU machine: `docker run -it alpine echo foo`~~
 - [X] ~~Docker configuration file: `cat /etc/docker/daemon.json`~~
 - [X] ~~Docker runtime configuration: `docker info | grep runtime`~~
 - [X] ~~NVIDIA shared directory: `ls -la /run/nvidia`~~
 - [X] ~~NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`~~
 - [X] ~~NVIDIA driver directory: `ls -la /run/nvidia/driver`~~
 - [X] kubelet logs `journalctl -u kubelet > kubelet.logs`

Typical pre-driver/pre-toolkit config errors complaining about runtime class.. nothing out of the ordinary in this log stack

Noticed the 525 version of the driver container was pushed yesterday and tried it out -- same issue.. i suspect the package cache might be getting hit with a SSL warning -- just not sure as not logs indicate

root@rke2-server:~/gpu-operator# kubectl logs -f nvidia-driver-daemonset-4zzgx --all-containers=true -n gpu-operator
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/rke2-agent.[readactedurl].com labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-ns5t8 condition met
unbinding device 0000:03:00.0
unbinding device 0000:05:00.0
unbinding device 0000:0d:00.0
unbinding device 0000:16:00.0
Uncordoning node rke2-agent.[readactedurl].com...
node/rke2-agent.[readactedurl].com already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/rke2-agent.[readactedurl].com labeled
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-525.60.13
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.4.0-125-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease is not valid yet (invalid for another 4h 32min 40s). Updates for this repository will not be applied.
E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease is not valid yet (invalid for another 4h 3min 56s). Updates for this repository will not be applied.
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

@jeremy-london Can you double check if the timezone is correctly configured on your node?

@shivamerla I think this is where i am leaning as well -- it appears there is some sort of apt-get update going on in there.. and I found apt-get update on the host shows the same conditions

I update the tzdata and that seemed to fix the host -- ill report back if it fixes the containerd runtime here

Possibly setting TZ or /etc/timezone in the container would solve it, but ultimately if it respects the node its running on then ill get each configured

Changed some configs around -- product required a FIPs compliant kernel so had to rebuild today

DISA STIG + FIPS-updates enabled

Got things back to the same state and re-ran the nvidia-gpu-operator

root@rke2-server:~# kubectl logs -f nvidia-driver-daemonset-8l6jj --all-containers=true -n gpu-operator
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
DRIVER_ARCH is x86_64
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Creating directory NVIDIA-Linux-x86_64-525.60.13
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Verifying archive integrity... OK
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-v595f condition met
unbinding device 0000:03:00.0
unbinding device 0000:0c:00.0
unbinding device 0000:15:00.0
unbinding device 0000:1e:00.0
Uncordoning node rke2-agent.[redactedurl].com...
node/rke2-agent.[redactedurl].com already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.4.0-1068-fips

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

root@rke2-server:~# uname -r
5.4.0-1068-fips

Seeing the error Could not resolve Linux kernel version -- Is this kernel not supported?

@jeremy-london the error is from here. Can you run below command on the node to make sure kernel-headers are available for this kernel?

KERNEL_VERSION=5.4.0-1068-fips && \
apt-cache show "linux-headers-${KERNEL_VERSION}" 2> /dev/null | \
      sed -nE 's/^Version:\s+(([0-9]+\.){2}[0-9]+)[-.]([0-9]+).*/\1-\3/p' | head -1

root@rke2-server:~# KERNEL_VERSION=5.4.0-1068-fips && \
> apt-cache show "linux-headers-${KERNEL_VERSION}" 2> /dev/null | \
>       sed -nE 's/^Version:\s+(([0-9]+\.){2}[0-9]+)[-.]([0-9]+).*/\1-\3/p' | head -1
5.4.0-1068

@shivamerla looks like we need to make ubuntu advantage repositories configured on the host accessible to the driver container. Please follow the instructions here to create a ConfigMap with these repositories and injecting them into driver container.

Moreover - I just tested again with 515-signed hoping that might support FIPs kernels.. but no dice

root@rke2-server:~# kubectl logs -f nvidia-driver-daemonset-2g8hk --all-containers=true -n gpu-operator

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 515 for Linux kernel version 5.4.0-1068-fips

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Updating the package cache...
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-j799c condition met
unbinding device 0000:03:00.0
unbinding device 0000:0c:00.0
unbinding device 0000:15:00.0
unbinding device 0000:1e:00.0
Uncordoning node rke2-agent.[redactedurl].com...
node/rke2-agent.[redactedurl].com already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Installing NVIDIA driver kernel modules...
Hit:1 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://us.archive.ubuntu.com/ubuntu focal-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-security InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package linux-objects-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Unable to locate package linux-signatures-nvidia-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Unable to locate package linux-modules-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

If this kernel is not supported.. what options exist? Driver install on the host directly, maybe toolkit as well if similar issues creep up?

@jeremy-london can you follow instructions from comment https://github.com/NVIDIA/gpu-operator/issues/457#issuecomment-1343449476. Yes another option is to pre-install drivers on the host in this case. Container-Toolkit doesn't need to be pre-installed as it doesn't have kernel specific runtime dependencies like the driver container.

@shivamerla Seems we are getting closer --

I added the following to a file, then created the config map/helm update with the new settings

deb https://esm.ubuntu.com/cis/ubuntu focal main
# deb-src https://esm.ubuntu.com/cis/ubuntu focal main

deb https://esm.ubuntu.com/infra/ubuntu focal-infra-security main
# deb-src https://esm.ubuntu.com/infra/ubuntu focal-infra-security main

deb https://esm.ubuntu.com/infra/ubuntu focal-infra-updates main
# deb-src https://esm.ubuntu.com/infra/ubuntu focal-infra-updates main

deb https://esm.ubuntu.com/fips-updates/ubuntu focal-updates main
# deb-src https://esm.ubuntu.com/fips-updates/ubuntu focal-updates main

(That's all the extra ones i have on the host)

Now im dealing with a few other packages not streaming in but seeing an nvidia specific one and wondering if the kernel version is in the support pool for those packages

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 515 for Linux kernel version 5.4.0-1068-fips

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Updating the package cache...
Installing NVIDIA driver kernel modules...
Hit:1 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:2 http://us.archive.ubuntu.com/ubuntu focal-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-security InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package linux-objects-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Unable to locate package linux-signatures-nvidia-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Unable to locate package linux-modules-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

@jeremy-london yes, that is correct, precompiled packages are not available for this kernel. Please use driver.version as 525.60.13 instead of 515-signed.

NVIDIA / gpu-operator

v22.9.0 - nvidia-driver-daemonset/nvidia-driver-ctr fails to start #457

1. Quick Debug Checklist

Default values for gpu-operator.

This is a YAML-formatted file.

Declare variables to be passed into your templates.

cleanup CRD on chart un-install

upgrade CRD on chart upgrade, requires --disable-openapi-validation flag

to be passed during helm upgrade.

Private mirror repository configuration

custom ssl key/certificate configuration

vGPU licensing configuration

vGPU topology daemon configuration

kernel module configuration for NVIDIA driver

configuration for controlling rolling update of NVIDIA driver DaemonSet pods

maximum number of nodes to simultaneously apply pod updates on.

can be specified either as number or percentage of nodes. Default 1.

Plugin configuration

standalone dcgm hostengine

disabled by default to use embedded nv-hostengine by exporter

Experimental and only deploys nvidia-fs driver on Ubuntu