Closed chi2liu closed 1 year ago
Can you give us a bit more detail on your setup? What type of GPU do you have? What is your driver version? How are you deploying / running gpu-feature-discovery, etc.
We have a k8s cluster(v1.21.1) with an A100 dual GPU node. I enabled the MIG mode and deployed it as a 4-card MIG
NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
sudo nvidia-smi -mig 1;
nvidia-smi --query-gpu=pci.bus_id,mig.mode.current --format=csv;
nvidia-smi;
sudo reboot
sudo nvidia-smi mig --list-gpu-instance-profiles
sudo nvidia-smi mig -cgi 9,3g.20gb -C
sudo nvidia-smi mig -lgi
I installed the k8s-device-plugin with 0.12.1 version with migStrategy=single.
Then I installed gpu-feature-discovery by helm and get the error of pod
nvgfd-gpu-feature-discovery-jxsk8 0/1 CrashLoopBackOff 41 3h6m
helm upgrade -i nvgfd --version=0.6.0 --set migStrategy=single nvgfd/gpu-feature-discovery --namespace kube-system
logs of pod
2022/06/15 06:25:03 Starting OS watcher.
2022/06/15 06:25:03 Loading configuration.
gpu-feature-discovery: 2022/06/15 06:25:03 Running gpu-feature-discovery in version 416d12ce
gpu-feature-discovery: 2022/06/15 06:25:03
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"gfd": {
"oneshot": false,
"noTimestamp": false,
"sleepInterval": "1m0s",
"outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
}
}
gpu-feature-discovery: 2022/06/15 06:25:03 Start running
gpu-feature-discovery: 2022/06/15 06:25:03 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/06/15 06:25:03 Exiting
gpu-feature-discovery: 2022/06/15 06:25:03 Error: error creating NVML labeler: error creating resource labeler: failed to construct MIG resource labeler: failed to create labeler for mig-strategy=single: failed to check for empty MIG-enabled devices: NVML error: Not Found
We’ve found some bugs in the 0.6.0 release that we are currently working to fix, but this error is slightly different. Can you try installing the 0.5.0 version of GFD and reporting if you see similar errors?
Yes, it is work in the 0.5.0 version. Thanks for you support
We have published an update to GFD as 0.6.1. Please let us know if it resolves your issue.
@chi2liu were you able to confirm that this resolved your issue?
@chi2liu I am closing this issue. If the latest release still shows this behaviour, please reopen.
I deployed the gpu-feature-discovery by helm, but I met the error logs of pod
gpu-feature-discovery: 2022/06/15 03:25:50 Start running gpu-feature-discovery: 2022/06/15 03:25:50 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory gpu-feature-discovery: 2022/06/15 03:25:50 Exiting gpu-feature-discovery: 2022/06/15 03:25:50 Error: error creating NVML labeler: error creating resource labeler: failed to construct MIG resource labeler: failed to create labeler for mig-strategy=single: failed to check for empty MIG-enabled devices: NVML error: Not Found