No devices were found in openshift

garyyang85 commented 9 months ago

1. Quick Debug Information

OS/Version Red Hat Enterprise Linux CoreOS release 4.12
Kernel Version: 4.18.0-372.69.1.el8_6.x86_64
Container Runtime Type/Version: CRI-O
Openshift 4.12.29
GPU Operator Version: 23.9.1

2. Issue or feature description

nvidia-driver-daemonset-xx pod reports "Startup probe failed: No devices were found" in events, but I can see the v100 GPU is ready on the os, below is the "lspci" output

03:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
13:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

3. Steps to reproduce the issue

Deploy the GPU operator, cluster-policy definition.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  creationTimestamp: '2023-12-20T13:06:29Z'
  generation: 2
  name: gpu-cluster-policy
  resourceVersion: '275859864'
  uid: 71e06b17-5b47-4ab0-aae9-8034a2e30e42
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    enabled: true
    certConfig:
      name: ''
    repository: nvcr.io/nvidia
    kernelModuleConfig:
      name: ''
    usePrecompiled: false
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    version: 535.104.05
    virtualTopology:
      config: ''
    image: driver
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia

cdesiniotis commented 8 months ago

@garyyang85 No devices were found typically indicates that GPU initialization failed. Can you get system logs by running dmesg | grep -i nvrm on the host?

fzhan commented 4 months ago

I have "[189160.303788] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.15" from dmesg | grep -i nvrm

NVIDIA / gpu-operator