Open xiaoxiaoboyyds opened 5 months ago
First, the value of default_runtime_name in containerd should be nvidia. After setting the value, you need to follow the documentation to enable GPU Support in Kubernetes
Just one command.
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
Remember to restart containerd and kubelet
same error with a tesla card running a rancher deployment:
root@nvidia-device-plugin-daemonset-nvbr7:/# nvidia-device-plugin
2024/07/21 15:56:05 Starting FS watcher.
2024/07/21 15:56:05 Starting OS watcher.
2024/07/21 15:56:05 Starting Plugins.
2024/07/21 15:56:05 Loading configuration.
2024/07/21 15:56:05 Updating config with default resource matching patterns.
2024/07/21 15:56:05
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2024/07/21 15:56:05 Retreiving plugins.
2024/07/21 15:56:05 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2024/07/21 15:56:05 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/07/21 15:56:05 Incompatible platform detected
2024/07/21 15:56:05 If this is a GPU node, did you configure the NVIDIA Container Toolkit?
2024/07/21 15:56:05 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2024/07/21 15:56:05 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2024/07/21 15:56:05 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2024/07/21 15:56:05 No devices found. Waiting indefinitely.
nvidia-smi works as expected
I am using the above containerd toml file with the default_runtime_name set to nvidia
EDIT:
after reviewing more documetnation and fixing some issues within my config file, I get a different error:
}
I0721 17:00:46.190858 46 main.go:317] Retrieving plugins.
E0721 17:00:46.191062 46 factory.go:87] Incompatible strategy detected auto
E0721 17:00:46.191076 46 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0721 17:00:46.191084 46 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0721 17:00:46.191093 46 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0721 17:00:46.191101 46 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0721 17:00:46.191112 46 main.go:346] No devices found. Waiting indefinitely.
Please ask if this problem has been solved?
Can someone share any findings on this issue? I've spent the entire last weekend to get this working. But can't seem to make it work.
I patched daemonset nvdp-nvidia-device-plugin
with following command:
kubectl -n nvidia-device-plugin patch ds nvdp-nvidia-device-plugin \
--type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--device-discovery-strategy=tegra"]}]'
This is equivalent to manually specifying the detection strategy to tegra
. My GPU is 4090.
Thanks @MasonXon ! It worked.
I had to set the default_runtime_name to nvidia, like @ZYWNB666 recommend. nvidia-ctk runtime configure --runtime=containerd
added all the runtime configs for the ctk, but did not change that line.
After manually editing /etc/containerd/config.toml
, restarting containerd via systemctl, and restarting the deamonset pod it worked!
Adding --set-as-default
would have given you what you want:
nvidia-ctk runtime configure --runtime=containerd --set-as-default
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Why doesn't my Kubernetes node recognize the GPU after successfully installing my drivers and Containerd? This is the content of /etc/containerd/config.toml.
This is the content of nvidia-smi
but