AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

containerd and nvidia-container-runtime instead of nvidia-docker2 #51

Open Frank-17 opened 2 years ago

Frank-17 commented 2 years ago

Any chance to have the device plugin working on containerd without nvidia-docker2?

I have rebuild my cluster with Conteinerd and on my worker nodes the following are installed libnvidia-container nvidia-container-toolkit nvidia-container-runtime

but the device plugin rises the error:

0425 10:34:29.375414 1 main.go:18] Start gpushare device plugin I0425 10:34:29.382160 1 gpumanager.go:28] Loading NVML I0425 10:34:29.382601 1 gpumanager.go:31] Failed to initialize NVML: could not load NVML library. I0425 10:34:29.382616 1 gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to nvidia?

The default runtime has been setup to nvidia-container-runtime

[plugins."io.containerd.runtime.v1.linux"] no_shim = false runtime = "nvidia-container-runtime" runtime_root = "" shim = "containerd-shim" shim_debug = false

Anyone has found a workaround? Any plan to replace nvidia-docker2 with nvidia-container-runtime

Thanks

vio-f commented 2 years ago

Yes I followed this to get containerd it running but I still have issues. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#containerd

has-avila commented 5 months ago

This is a great topic. Now that Kubernetes removed support for Docker as a container runtime. Has anyone found a workaround to implement GPU sharing with the containerd without issues? Thanks