NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.88k stars 305 forks source link

Documentation clarification about containerd tweaks #519

Open aavbsouza opened 1 year ago

aavbsouza commented 1 year ago

Hello everyone, I am reading the documentation. It appears to be necessary to change the config.toml of the containerd runtime only if I use a host installed nvidia-toolkit Is that interpretation correct? The documentation also states that the toolkit option of the gpu operator uses nvidia-docker2, but I thought that this software was deprecated some time ago.

Thanks 👍

mikemckiernan commented 1 year ago

Thanks @aavbsouza for using the docs. Were you reading in the same section as the following link?

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#step-1-install-containerd

aavbsouza commented 1 year ago

Hello @mikemckiernan , I was looking into these sections: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#bare-metal-passthrough-with-default-configurations-on-red-hat-enterprise-linux

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#bare-metal-passthrough-with-pre-installed-nvidia-drivers

The customization of the containerd runtime only appears if using the host nvidia-toolkit. In the past I was able to install the GPU operator using both host Nvidia driver and toolkit and I was thinking if would be simpler to install using only the GPU operator, without thinkering with these configurations

elezar commented 1 year ago

@aavbsouza using the NVIDIA Container Toolkit on the host assumes that this has been set up for the container engine being used already. In the case of Conainerd, this implies that the nvidia runtime class has been added to the config.toml for Containerd with the binary set to the nvidia-container-runtime installed on the host.

Furthermore, in the case where the driver container is used, this also assumes that the root option in the NVIDIA Container Toolkit config (usually /etc/nvidia-container-runtime/config.toml) has been set accordingly.

When the operator is used to manage the NVIDIA Container Toolkit, this is installed on the host and the config for Containerd (or Crio, or Docker) is updated to register the nvidia runtime class. This does make modifications to Containerd's config.toml on the host, but does not require the user to modify this file. The config.toml file for the toolkit is also updated to refer to the correct root -- depending on whether the driver container or host-installed drivers are used.

With regards to:

The customization of the containerd runtime only appears if using the host nvidia-toolkit. In the past I was able to install the GPU operator using both host Nvidia driver and toolkit and I was thinking if would be simpler to install using only the GPU operator, without thinkering with these configurations

Are you asking whether it would be simpler to let the GPU Operator manage both the Driver and Container Toolkit instead of managing these yourself? In general, the answer is yes, but may depend on your exact use cases. Note that as stated above, this would still update the Container config on the host, but would not require any user intervention. @shivamerla could provide more information on which use cases the GPU Operator may not be able to handle to help determine whether there is an argument for keeping the host drivers and / or toolkit.

Finally, with regards to:

The documentation also states that the toolkit option of the gpu operator uses nvidia-docker2, but I thought that this software was deprecated some time ago.

Yes, that package is on the deprecation path and we need to improve our documentation surrounding it. References to nvidia-docker2 have been removed from the Toolkit Container documentation, but we may have missed the places in the GPU Operator Documentation where this is discussed. In the case of Containerd specifically, only the nvidia-container-toolkit package and its dependencies (nvidia-container-toolkit-base, libnvidia-container-tools, and libnvidia-container1) are required. Also note that the next release of the nvidia-container-toolkit package will include CLI support for updating the Containerd config on the host to register the nvidia runtime.

aavbsouza commented 1 year ago

Hello @elezar thanks for the detailed answer. The gpu-operator changing the containderD config files (config.toml) makes more sense and in hindsight explains the reason to not be needed to manually change these files when the operator configure both driver and nvidia toolkit. The location of these files would be given to the GPU_OPERATOR using these environment variables (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd), is that correct? I believe that adding these details to the documentation would be useful to others thanks

aavbsouza commented 1 year ago

Hello, I've installed the gpu-operator with the operator managing both the driver and the nvidia-toolkit. It was way easier than using the host ones. The installation was little rough because of what I believe to be this bug: https://github.com/containerd/containerd/issues/7843. The installation worked without issues after upgrading.

thanks