NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 304 forks source link

Openshift with A100 GPU driver pod not getting ready and nvidia-smi output "No device were found" #653

Open anoopsinghnegi opened 10 months ago

anoopsinghnegi commented 10 months ago

1. Quick Debug Information

2. Issue or feature description

GPU-operator's driver pod failed to get ready, cluster policy installed default with "use_ocp_driver_toolkit" as selected, driver pod is not getting ready because its Startup probe failed: No devices were found GPU node is of type A100

[root@worker6 driver]# lspci | grep -i nvidia
13:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
[root@worker6 driver]#

since the driver pod is not fully ready(container nvidia-driver-ctr is not getting ready) other pods are stuck in init state

[core@master0 tmp]$ kubectl get po -n nvidia-gpu-operator
NAME                                                  READY   STATUS     RESTARTS      AGE
gpu-feature-discovery-mdkg4                           0/1     Init:0/1   0             52m
gpu-operator-595587c664-96gsn                         1/1     Running    0             114m
nvidia-container-toolkit-daemonset-vpmlh              0/1     Init:0/1   0             52m
nvidia-dcgm-2gxdj                                     0/1     Init:0/1   0             52m
nvidia-dcgm-exporter-q96qf                            0/1     Init:0/2   0             52m
nvidia-device-plugin-daemonset-stb74                  0/1     Init:0/1   0             52m
nvidia-driver-daemonset-412.86.202311271639-0-qkdsz   1/2     Running    2 (11m ago)   53m
nvidia-node-status-exporter-fmnkb                     1/1     Running    0             53m
nvidia-operator-validator-h8kwp                       0/1     Init:0/4   0             52m
[core@master0 tmp]$
[core@master0 tmp]$ kubectl exec -ti nvidia-driver-daemonset-412.86.202311271639-0-qkdsz -n nvidia-gpu-operator -- nvidia-smi
No devices were found
command terminated with exit code 6
[core@master0 tmp]$

We have tried with use_ocp_driver_toolkit set to "false" and applied the entitlement but even with it driver pod failed to download packages "kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64" and went into crash-loopback state, the following error is coming

Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.6 install kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64
Error: Unable to find a match: kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64
++ rm -rf /tmp/tmp.ffQrW5HEcg

3. Steps to reproduce the issue

Use following manifest YML to create cluster-policy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
status:
  namespace: nvidia-gpu-operator
  state: notReady

let us know if more information is required.

cdesiniotis commented 10 months ago

No devices were found typically indicates the driver failed to initialize. Can you collect system logs by running dmesg | grep -i nvrm on the host?