NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Apache License 2.0
292 stars 47 forks source link

Failed to construct MIG resource labeler #24

Closed chi2liu closed 1 year ago

chi2liu commented 2 years ago

I deployed the gpu-feature-discovery by helm, but I met the error logs of pod

gpu-feature-discovery: 2022/06/15 03:25:50 Start running gpu-feature-discovery: 2022/06/15 03:25:50 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory gpu-feature-discovery: 2022/06/15 03:25:50 Exiting gpu-feature-discovery: 2022/06/15 03:25:50 Error: error creating NVML labeler: error creating resource labeler: failed to construct MIG resource labeler: failed to create labeler for mig-strategy=single: failed to check for empty MIG-enabled devices: NVML error: Not Found

klueska commented 2 years ago

Can you give us a bit more detail on your setup? What type of GPU do you have? What is your driver version? How are you deploying / running gpu-feature-discovery, etc.

chi2liu commented 2 years ago

We have a k8s cluster(v1.21.1) with an A100 dual GPU node. I enabled the MIG mode and deployed it as a 4-card MIG NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6

sudo nvidia-smi -mig 1;
  nvidia-smi --query-gpu=pci.bus_id,mig.mode.current --format=csv;
  nvidia-smi;
  sudo reboot
  sudo nvidia-smi mig --list-gpu-instance-profiles
  sudo nvidia-smi mig -cgi 9,3g.20gb -C
  sudo nvidia-smi mig -lgi
image

I installed the k8s-device-plugin with 0.12.1 version with migStrategy=single. Then I installed gpu-feature-discovery by helm and get the error of pod nvgfd-gpu-feature-discovery-jxsk8 0/1 CrashLoopBackOff 41 3h6m helm upgrade -i nvgfd --version=0.6.0 --set migStrategy=single nvgfd/gpu-feature-discovery --namespace kube-system

logs of pod

2022/06/15 06:25:03 Starting OS watcher.
2022/06/15 06:25:03 Loading configuration.
gpu-feature-discovery: 2022/06/15 06:25:03 Running gpu-feature-discovery in version 416d12ce
gpu-feature-discovery: 2022/06/15 06:25:03
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
gpu-feature-discovery: 2022/06/15 06:25:03 Start running
gpu-feature-discovery: 2022/06/15 06:25:03 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/06/15 06:25:03 Exiting
gpu-feature-discovery: 2022/06/15 06:25:03 Error: error creating NVML labeler: error creating resource labeler: failed to construct MIG resource labeler: failed to create labeler for mig-strategy=single: failed to check for empty MIG-enabled devices: NVML error: Not Found
klueska commented 2 years ago

We’ve found some bugs in the 0.6.0 release that we are currently working to fix, but this error is slightly different. Can you try installing the 0.5.0 version of GFD and reporting if you see similar errors?

chi2liu commented 2 years ago

Yes, it is work in the 0.5.0 version. Thanks for you support

klueska commented 2 years ago

We have published an update to GFD as 0.6.1. Please let us know if it resolves your issue.

elezar commented 2 years ago

@chi2liu were you able to confirm that this resolved your issue?

elezar commented 1 year ago

@chi2liu I am closing this issue. If the latest release still shows this behaviour, please reopen.