Open wangzheyuan opened 3 months ago
it seems your nvidia-driver may not be installed correctly, you can try install nvidia-device-plugin v0.14, can see if that can be launched correctly
NVIDIA GPU Operator works fine, but nvidia-device-plugin v0.14.5 has the same error:
I0822 08:51:42.921468 1 main.go:154] Starting FS watcher.
I0822 08:51:42.921503 1 main.go:161] Starting OS watcher.
I0822 08:51:42.921566 1 main.go:176] Starting Plugins.
I0822 08:51:42.921574 1 main.go:234] Loading configuration.
I0822 08:51:42.921623 1 main.go:242] Updating config with default resource matching patterns.
I0822 08:51:42.921704 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0822 08:51:42.921708 1 main.go:256] Retreiving plugins.
I0822 08:51:42.921955 1 factory.go:107] Detected NVML platform: found NVML library
I0822 08:51:42.921968 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0822 08:51:42.925620 1 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0822 08:51:42.925629 1 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0822 08:51:42.925630 1 factory.go:79] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0822 08:51:42.925632 1 factory.go:80] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0822 08:51:42.925634 1 factory.go:81] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0822 08:51:42.925723 1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: nvml init failed: ERROR_UNKNOWN
you can look toolkit
pod log.
you can look
toolkit
pod log.
You mean NVIDIA Container Toolkit?
If I install hami without privileged=true in daemonsetnvidia.yaml, device-plugin is CrashLoopBackOff. Here is device-plugin's log:
If I installed hami with privileged=true in daemonsetnvidia.yaml, device-plugin works well. However, containers that request vGPU will encounter following error:
Here is vgpu-scheduler-extender's log:
Ubuntu: 22.04.4 Kubernetes: RKE2 1.28.12 Containerd: v1.7.17-k3s1 NVIDIA Container Toolkit: 1.15.0