Open Dragoncell opened 9 months ago
@Dragoncell I am not too familiar with COS. Is there anything specific to the OS that means that the toolkit-container
(with some modification) cannot be used to install the components of the NVIDIA Container Toolkit? What it essentially does is:
/cc
One thing that comes to mind is that COS filesystem is read only (https://cloud.google.com/container-optimized-os/docs/concepts/disks-and-filesystem), so only certain paths are mounted with exec. So I think the path on the host should be configurable, i.e. we can't use default of /usr/bin
since it's readonly.
The install path for NVIDIA Container Toolkit is already configurable today with the operator: https://github.com/NVIDIA/gpu-operator/blob/v23.9.1/api/v1/clusterpolicy_types.go#L669-675. Here is the setting in the helm chart: https://github.com/NVIDIA/gpu-operator/blob/v23.9.1/deployments/gpu-operator/values.yaml#L229
Thanks for the pointer of the config.
From my understanding, to use the toolkit in GPU Operator in our cases, we requires two changes and below are my test setup for trying it out: based on https://github.com/NVIDIA/gpu-operator/tree/release-23.9
changes: /assets/state-container/toolkit/0500_daemonset.yaml a): disabled the driver-validation before we support it b) update the driver install path and the env
env:
- name: NVIDIA_DRIVER_ROOT
value: "/home/kubernetes/bin/nvidia"
- name: DRIVER_ROOT
value: "/home/kubernetes/bin/nvidia"
- name: DRIVER_ROOT_CTR_PATH
value: "/home/kubernetes/bin/nvidia"
volumes:
- name: driver-install-path
hostPath:
path: /home/kubernetes/bin/nvidia
New: seems like env is not necessary needed to be set in toolkit's daemonset, in the configmap, it can be updated to
driver_root=/home/kubernetes/bin/nvidia
changes in values.yaml:
installDir: "/home/kubernetes/bin/nvidia"
a) It can Install the binary in correct path
/home/kubernetes/bin/nvidia/toolkit$ ls
libnvidia-container-go.so.1 libnvidia-container.so.1.14.2 nvidia-container-runtime nvidia-container-runtime.cdi nvidia-container-runtime.legacy.real nvidia-ctk
libnvidia-container-go.so.1.14.2 nvidia-container-cli nvidia-container-runtime-hook nvidia-container-runtime.cdi.real nvidia-container-runtime.real nvidia-ctk.real
libnvidia-container.so.1 nvidia-container-cli.real nvidia-container-runtime-hook.real nvidia-container-runtime.legacy nvidia-container-toolkit
b) the pod failing in a crash loop somehow, the log indicated that pod status:
nvidia-container-toolkit-daemonset-nlxdv 0/1 CrashLoopBackOff 4 (72s ago) 3m10s
pod logs:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
load-kmods = true
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-container-cli"
root = "/home/kubernetes/bin/nvidia"
[nvidia-container-runtime]
log-level = "info"
mode = "cdi"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["nvidia.cdi.k8s.io/"]
default-kind = "management.nvidia.com/gpu"
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk"
time="2024-02-06T06:24:07Z" level=info msg="Creating control device nodes at /home/kubernetes/bin/nvidia"
time="2024-02-06T06:24:07Z" level=fatal msg="error: failed to create control device nodes: failed to create device node nvidiactl: no such file or directory"
time="2024-02-06T06:24:07Z" level=info msg="Shutting Down"
time="2024-02-06T06:24:07Z" level=error msg="error running nvidia-toolkit: unable to install toolkit: error running [toolkit install --toolkit-root /home/kubernetes/bin/nvidia/toolkit] command: exit status 1"
Wondering why it failed to find file or directory under /home/kubernetes/bin/nvidia ?
a) Support both NVIDIA_DRIVER_ROOT and DRIVER_ROOT env in toolkit daemonset similar to the installDir
. e.g a parameter called InstallDriverRoot
in values.yaml and the operator code can update the daemonset's env accordingly
b) What's the recommend way for updating the driver root in toolkit ? and is using the DRIVER_ROOT
, NVIDIA_DRIVER_ROOT
or DRIVER_ROOT_CTR_PATH
in env looks right to you to pass the custom path to the nvidia toolkit or through the configmap ?
Does above makes sense ? Let me know what's your thought, thanks
@Dragoncell the operands of the GPU Operator have some logic included to automatically detect the location where the driver is available (the driver root).
For example, in the case of the container toolkit we construct an entrypoint.sh
here: https://github.com/NVIDIA/gpu-operator/blob/5f36d3600da50e6a0239996a7b12f677eb66a671/assets/state-container-toolkit/0400_configmap.yaml#L10-L22 that checks the output of the driver validator and sets the NVIDIA_DRIVER_ROOT
envvar.
This logic would have to be updated to allow for a custom driver root to be specified. There is a merge request outstanding (see https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960) where this has been proposed .
One of the issues that has prevented us from merging the MR as-is, is that we currently assume that a driver-root (e.g. the path defined by NVIDIA_DRIVER_ROOT
is a full filesystem that can be chroot
ed to. The fact that your /home/kubernetes/bin/nvidia
location is not such a root is the reason for the errors that you are seeing when creating the control device nodes in the NVIDIA Container Toolkit (I have created https://github.com/NVIDIA/nvidia-container-toolkit/issues/344 to at least provide an option to disable this). We have started separating the notion of a driver root (from the context of libraries) and a device root for the creation of device nodes.
With regards to your next steps:
a) Support both NVIDIA_DRIVER_ROOT and DRIVER_ROOT env in toolkit daemonset similar to the installDir. e.g a parameter called InstallDriverRoot in values.yaml and the operator code can update the daemonset's env accordingly
I think that this in general makes sense. A user should be able to specify both where their driver is rooted, and where there device nodes are rooted. If these are not specified we will revert to logic to autodetect these as we currently do.
Maybe extending the operator values.yaml
as follows:
diff --git a/deployments/gpu-operator/values.yaml b/deployments/gpu-operator/values.yaml
index 359b73c2..957ac1fb 100644
--- a/deployments/gpu-operator/values.yaml
+++ b/deployments/gpu-operator/values.yaml
@@ -123,6 +123,15 @@ mig:
strategy: single
driver:
+ # libraryRoot specifies the root at which the driver libraries are available.
+ libraryRoot: "auto"
+ # deviceRoot specifies the root at which NVIDIA device nodes are available.
+ # If this is unspecified or empty, the value of the libraryRoot will be used.
+ # Note that if driver.libraryRoot is set to auto, the resolved value is used.
+ # For a value of 'auto', the device root is detected. Here, if the resolved
+ # libraryRoot is a full filesystem such as '/' or '/run/nvidia/driver' when
+ # managed by the driver container this path will be used.
+ deviceRoot: ""
enabled: true
nvidiaDriverCRD:
enabled: false
With regards to:
b) What's the recommend way for updating the driver root in toolkit ? and is using the DRIVER_ROOT, NVIDIA_DRIVER_ROOT or DRIVER_ROOT_CTR_PATH in env looks right to you to pass the custom path to the nvidia toolkit or through the configmap ?
The short answer is NVIDIA_DRIVER_ROOT
. In general, DRIVER_ROOT
and NVIDIA_DRIVER_ROOT
can be considered aliases of each other -- although this is not consistently applied in the container toolkit. From the context of the usage in the GPU Operator, only NVIDIA_DRIVER_ROOT
is considered (as defined here https://github.com/NVIDIA/nvidia-container-toolkit/blob/15d905def056f37da6fa67be25b363095cdab79a/tools/container/toolkit/toolkit.go#L124) (I have also created #343 to track making this consistent).
Note that the DRIVER_ROOT_CTR_PATH
is the path at which the driver root is mounted into the container where the toolkit is running. This is important in the case of containerized CDI spec generation since the generated spec will locate libraries relative to this path and these need to be transformed to host paths. In the case of our other components such as the Device Plugin, we generally hardcode this path to /driver-root
in the container to make it easier to reason about.
Thanks for the above changes. I cherrypicked the changes and test it out using the GPU Operator with modified env for container toolkit
- name: CREATE_DEVICE_NODES
value: ""
I saw the create device node error is gone, but encounter a new error like below:
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
[nvidia-container-cli]
environment = []
ldconfig = "@/home/kubernetes/bin/nvidia/sbin/ldconfig"
load-kmods = true
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-container-cli"
root = "/home/kubernetes/bin/nvidia"
[nvidia-container-runtime]
log-level = "info"
mode = "cdi"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["nvidia.cdi.k8s.io/"]
time="2024-02-13T23:34:47Z" level=info msg="Generating CDI spec for management containers"
default-kind = "management.nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia*: pattern /dev/nvidia* not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-caps/nvidia-cap*: pattern /dev/nvidia-caps/nvidia-cap* not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-modeset: pattern /dev/nvidia-modeset not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-uvm-tools: pattern /dev/nvidia-uvm-tools not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-uvm: pattern /dev/nvidia-uvm not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidiactl: pattern /dev/nvidiactl not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia*: pattern /dev/nvidia* not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-caps/nvidia-cap*: pattern /dev/nvidia-caps/nvidia-cap* not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-modeset: pattern /dev/nvidia-modeset not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-uvm-tools: pattern /dev/nvidia-uvm-tools not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidia-uvm: pattern /dev/nvidia-uvm not found"
time="2024-02-13T23:34:47Z" level=warning msg="Could not locate /dev/nvidiactl: pattern /dev/nvidiactl not found"
time="2024-02-13T23:34:47Z" level=fatal msg="error: error generating CDI specification: failed to genereate CDI spec for management containers: no NVIDIA device nodes found"
time="2024-02-13T23:34:47Z" level=info msg="Shutting Down"
time="2024-02-13T23:34:47Z" level=error msg="error running nvidia-toolkit: unable to install toolkit: error running [toolkit install --toolkit-root /home/kubernetes/bin/nvidia/toolkit] command: exit status 1"
From the container toolkit assumption, I guess it thinks that /dev is under the driverRoot directory. However, in our cases
under hostRoot /dev: ls | grep nvidia
nvidia-caps
nvidia-modeset
nvidia-uvm
nvidia-uvm-tools
nvidia0
nvidiactl
under hostDriverRoot /home/kubernetes/bin/nvidia:
NVIDIA-Linux-x86_64-535.104.12.run bin bin-workdir drivers drivers-workdir firmware lib64 lib64-workdir nvidia-drivers-535.104.12.tgz nvidia-installer.log share toolkit vulkan
I also tried to specify the root as "/", and it failed with an another error:
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-container-cli"
root = "/"
[nvidia-container-runtime]
log-level = "info"
mode = "cdi"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["nvidia.cdi.k8s.io/"]
default-kind = "management.nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk"
time="2024-02-13T23:51:46Z" level=info msg="Generating CDI spec for management containers"
time="2024-02-13T23:51:46Z" level=info msg="Selecting /host/dev/nvidia-modeset as /dev/nvidia-modeset"
time="2024-02-13T23:51:46Z" level=info msg="Selecting /host/dev/nvidia-uvm as /dev/nvidia-uvm"
time="2024-02-13T23:51:46Z" level=info msg="Selecting /host/dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools"
time="2024-02-13T23:51:46Z" level=info msg="Selecting /host/dev/nvidia0 as /dev/nvidia0"
time="2024-02-13T23:51:46Z" level=info msg="Selecting /host/dev/nvidiactl as /dev/nvidiactl"
time="2024-02-13T23:51:46Z" level=info msg="Selecting /host/dev/nvidia-caps/nvidia-cap1 as /dev/nvidia-caps/nvidia-cap1"
time="2024-02-13T23:51:46Z" level=info msg="Selecting /host/dev/nvidia-caps/nvidia-cap2 as /dev/nvidia-caps/nvidia-cap2"
time="2024-02-13T23:51:46Z" level=fatal msg="error: error generating CDI specification: failed to genereate CDI spec for management containers: failed to get CUDA version: failed to locate libcuda.so: pattern libcuda.so.*.* not found\n64-bit library libcuda.so.*.*: not found"
time="2024-02-13T23:51:46Z" level=info msg="Shutting Down"
time="2024-02-13T23:51:46Z" level=error msg="error running nvidia-toolkit: unable to install toolkit: error running [toolkit install --toolkit-root /home/kubernetes/bin/nvidia/toolkit] command: exit status 1"
Next Step:
As you mentioned: We have started separating the notion of a driver root (from the context of libraries) and a device root for the creation of device nodes.
Wondering what's the progress there ? I searched the code, and seems like there is a variable called librarySearchPaths
or devRoot
, wondering will it help in this case ?
Thanks for reporting this. The issue here is that the definition of what a "driverRoot" is from the perspective of the toolkit container is pretty rigid. It currently means that this folder is a chroot-able filesystem and BOTH the libraries AND device nodes are rooted there.
I have created #360 to add the option for specifying these separately. Since you're using /home/kubernetes/bin/nvidia
for both the NVIDIA_DRIVER_ROOT
and DRIVER_ROOT_CTR_PATH
, that should still work as expected and it should only be required to mount /
to /host
in the container (this should already be done by the operator) and set NVIDIA_DEV_ROOT
and DEV_ROOT_CTR_PATH
accordingly.
Thanks for the proposed dev-root option, I cherry-pick the commit on top of disable the create device node, and test it out with below envs:
export NVIDIA_DRIVER_ROOT=/home/kubernetes/bin/nvidia
export DRIVER_ROOT_CTR_PATH=/home/kubernetes/bin/nvidia
export NVIDIA_DEV_ROOT=/
export DEV_ROOT_CTR_PATH=/host
and it works out for container toolkit and the log looks good too:
$ kubectl get pods -n gpu-operator
nvidia-container-toolkit-daemonset-dlnb2 1/1 Running 0 8m17s
then I manually ssh into the node and execute for nvidia device plugin:
sudo touch /run/nvidia/validations/toolkit-ready
sudo touch /run/nvidia/validations/host-driver-ready
with the below env on configmap of device plugin
driver_root=/home/kubernetes/bin/nvidia
container_driver_root=$driver_root
export NVIDIA_DRIVER_ROOT=$driver_root
export CONTAINER_DRIVER_ROOT=$container_driver_root
export NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
I saw the pod failed due to below error:
time="2024-02-14T20:58:06Z" level=info msg="Generating CDI spec for resource: k8s.device-plugin.nvidia.com/gpu"
time="2024-02-14T20:58:06Z" level=warning msg="Could not locate /dev/nvidia0: pattern /dev/nvidia0 not found"
time="2024-02-14T20:58:06Z" level=warning msg="Could not locate /dev/nvidia0: pattern /dev/nvidia0 not found"
time="2024-02-14T20:58:06Z" level=warning msg="Could not locate /dev/nvidia0: pattern /dev/nvidia0 not found"
E0214 20:58:06.031396 1 main.go:123] error starting plugins: error creating plugin manager: unable to create cdi spec file: failed to get CDI spec: failed to create discoverer for common entities: error constructing discoverer for graphics mounts: failed to construct library locator: error loading ldcache: open /home/kubernetes/bin/nvidia/etc/ld.so.cache: no such file or directory
then I also tried with below env on device plugin:
driver_root=/
container_driver_root=/host
export NVIDIA_DRIVER_ROOT=$driver_root
export CONTAINER_DRIVER_ROOT=$container_driver_root
export NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
it seems like the device plugin is in working state
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-ktg68 1/1 Running 0 2m48s
gpu-operator-f58c4c94-lv9lk 1/1 Running 0 3m10s
noperator-node-feature-discovery-master-79487579c6-gxgxn 1/1 Running 0 3m10s
noperator-node-feature-discovery-worker-sxwfz 1/1 Running 0 3m10s
nvidia-container-toolkit-daemonset-mp7bd 1/1 Running 0 2m49s
nvidia-dcgm-exporter-8k8sx 1/1 Running 0 2m48s
nvidia-device-plugin-daemonset-vspsp 1/1 Running 0 2m48s
and log looks good too:
I0214 21:16:21.077283 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0214 21:16:21.078023 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0214 21:16:21.081576 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
then I deployed a GPU workload to try it out
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
resources:
limits:
nvidia.com/gpu: 1
and the pod is running good too:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-gpu-pod 1/1 Running 0 48s
Questions:
Setting the drive root to / seems working for the device plugin. however, it differs the value set in container toolkit. To set the driver root to /home/kubernetes/bin/nvidia, but I didn't find the simialr device root setting in the k8s device plugin repo: https://github.com/NVIDIA/k8s-device-plugin/blob/31f01c2e0c291443c1ddbefc8cdba55768c11275/cmd/nvidia-device-plugin/main.go#L68. In this case, what's recommendation of setup for the device plugin ? thanks
As I started the GPU Operator using the CDI config like this, besides the GPU pods is running good, what other signals or logs we can look at for verifying it indeed using CDI spec through device plugin as we wanted ? Thanks
helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi
@Dragoncell thanks for the update. I will have to dig a bit further into what is happening here. What I assume is happening is that the device plugin is being started as a management container and since the nvidia-cdi
runtime is being used, the driver files and devices are being mounted as expected into the device plugin container. This means that the device detection is working as expected, but may mean that the generated CDI specs for the devices are not as they should be.
Would you be able to confirm that running nvidia-smi
on the workload container shows the expected results (since you seem to just be running a sleep). You could also check the generated cdi specs at /var/run/cdi/
on the host.
In general, I think the device plugin is going to need similar change to properly handle the "split" driver and device root. I may have some time to look into it tomorrow, but I would assume that the specs need to be transformed in some way.
/cc @cdesiniotis
elezar Thanks for the suggestion.
With current working configuration of device plugin
driver_root=/
container_driver_root=/host
export NVIDIA_DRIVER_ROOT=$driver_root
export CONTAINER_DRIVER_ROOT=$container_driver_root
export NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
a)
I tried below running nvidia-smi
on the workload container:
with simply run the nvidia-smi without GPU request, it working as expected.
kubectl run nvidia-smi --restart=Never --rm -i --tty --image nvidia/cuda:11.0.3-base-ubuntu20.04 -- nvidia-smi
Tue Feb 20 21:55:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 36C P8 17W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
pod "nvidia-smi" deleted
however, if I run below pod without the export PATH and LD_LIBRARY_PATH, it failed with error like Warning Failed 11s (x2 over 12s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-base-ubuntu20.04
command: ["bash", "-c"]
args:
- |-
# export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
nvidia-smi;
resources:
limits:
nvidia.com/gpu: "1"
I looked at the OCI spec of the container, the PATH looks like PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
b) In the node, I did see two files present:
/var/run/cdi $ ls
k8s.device-plugin.nvidia.com-gpu.json management.nvidia.com-gpu.yaml
the config looks good to me
{"cdiVersion":"0.5.0","kind":"k8s.device-plugin.nvidia.com/gpu","devices":[{"name":"GPU-0b182573-6996-0f5d-ad7d-96241c70d91c","containerEdits":{"deviceNodes":[{"path":"/dev/nvidia0","hostPath":"/dev/nvidia0"}]}}],"containerEdits":{"deviceNodes":[{"path":"/dev/nvidia-modeset","hostPath":"/dev/nvidia-modeset"},{"path":"/dev/nvidia-uvm-tools","hostPath":"/dev/nvidia-uvm-tools"},{"path":"/dev/nvidia-uvm","hostPath":"/dev/nvidia-uvm"},{"path":"/dev/nvidiactl","hostPath":"/dev/nvidiactl"}],"hooks":[{"hookName":"createContainer","path":"/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk","args":["nvidia-ctk","hook","update-ldcache","--folder","/home/kubernetes/bin/nvidia/lib64"]}],"mounts":[{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libcuda.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libcuda.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.104.12","containerPath":"/home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.104.12","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/bin/nvidia-persistenced","containerPath":"/home/kubernetes/bin/nvidia/bin/nvidia-persistenced","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control","containerPath":"/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server","containerPath":"/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/bin/nvidia-smi","containerPath":"/home/kubernetes/bin/nvidia/bin/nvidia-smi","options":["ro","nosuid","nodev","bind"]},{"hostPath":"/home/kubernetes/bin/nvidia/bin/nvidia-debugdump","containerPath":"/home/kubernetes/bin/nvidia/bin/nvidia-debugdump","options":["ro","nosuid","nodev","bind"]}]}}
Questions:
/home/kubernetes/bin/nvidia
PATH in the OCI spec of the GPU container ? /home/kubernetes/bin/nvidia/bin
, be able to help this path issue ? Thanks
For pods inside the GPU Operator, after driver installation finished, they rely on container toolkit starts on the node for setting up the nvidia container runtime:
Download nvidia container runtime, hooks, container ctk(nvidia-ctk) and copy over from container to the host /run/nvidia/toolkit. [link] Update the containerd config file based on container runtime like nvidia or nvidia-cdi Generate the CDI spec for management containers if runtime is nvidia-cdi
Toolkit is a necessary component on GPU Operator, to make it work on COS, we needs to:
Support necessary binary from container toolkit for the COS platform. So far, the container runtime, hooks, and container CTK are not yet supported (supported platform lists).
Starting from COS109, the nvidia-ctk is pre-built in COS. However, in current state (intermediate CDI mode), it still requires nvidia container runtime (nvidia-cdi) binaries. For legacy mode support, nvidia container runtime(nvidia-legacy) and its hooks are also required
The goal is to achieve the same functionality of container toolkit on COS platform with custom installed driver, container runtime binaries path