Open edoyon90 opened 2 months ago
Update: After accessing the node through a privileged pod I was able to run sudo nvidia-ctk runtime configure --runtime=containerd
and then sudo systemctl restart containerd
After all the pods started up again the nvidia-device-plugin was able to detect the gpu and the tensorflow example ran successfully .
This isn't a viable workaround though because after the node is restarted for any reason, the config reverts back to the default.
AKS will need to change the container runtime for containerd on GPU enabled nodes, or enable users to configure it at the nodepool level.
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure
Describe the bug When following this guide https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool The nvidia-device-plugin is failing to detect the GPU on Ubuntu Linux OS, when doing a kubectl exec into the device plugin pod and manually running the startup script
nvidia-device-plugin
I get the following errorNVML: Unknown Error
Additionally the GPU-enabled workload meant to test the gpu nodes, does not work in either the UbuntuLinux or the AzureLinux os skusTo Reproduce Steps to reproduce the behavior:
Name:
Roles: agent
Labels: accelerator=nvidia
[...]
Capacity: [...] nvidia.com/gpu: 1 [...]