NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.45k stars 573 forks source link

Make vgpu failures non-fatal #672

Open elezar opened 3 weeks ago

elezar commented 3 weeks ago

This change treats errors in constructing vGPU labels as warnings.

If errors occur the nvidia.com/vgpu.present label is set to false instead of raising an error.

For example, on my mac:

./gpu-feature-discovery --oneshot --output="" --node-name=foo
I0422 20:59:12.321562   63053 main.go:139] Starting OS watcher.
I0422 20:59:12.321919   63053 main.go:144] Loading configuration.
I0422 20:59:12.323056   63053 main.go:156]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "useNodeFeatureAPI": false,
    "gfd": {
      "oneshot": true,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0422 20:59:12.323797   63053 factory.go:49] Detected non-NVML platform: could not load NVML library: dlopen(libnvidia-ml.so.1, 0x0001): tried: 'libnvidia-ml.so.1' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibnvidia-ml.so.1' (no such file), '/usr/lib/libnvidia-ml.so.1' (no such file, not in dyld cache), 'libnvidia-ml.so.1' (no such file)
I0422 20:59:12.323835   63053 factory.go:49] Detected non-Tegra platform: /sys/devices/soc0/family file not found
W0422 20:59:12.323847   63053 factory.go:72] No valid resources detected; using empty manager.
I0422 20:59:12.323853   63053 main.go:170] Start running
E0422 20:59:12.323900   63053 vgpu.go:41] "unable to get vGPU devices" err="error getting NVIDIA specific PCI devices: unable to read PCI bus devices: open /sys/bus/pci/devices: no such file or directory"
I0422 20:59:12.323917   63053 main.go:239] Creating Labels
nvidia.com/gfd.timestamp=1713812352
nvidia.com/vgpu.present=false
I0422 20:59:12.323928   63053 main.go:136] Exiting