Open aavbsouza opened 1 year ago
Thanks @aavbsouza for using the docs. Were you reading in the same section as the following link?
Hello @mikemckiernan , I was looking into these sections: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#bare-metal-passthrough-with-default-configurations-on-red-hat-enterprise-linux
The customization of the containerd runtime only appears if using the host nvidia-toolkit. In the past I was able to install the GPU operator using both host Nvidia driver and toolkit and I was thinking if would be simpler to install using only the GPU operator, without thinkering with these configurations
@aavbsouza using the NVIDIA Container Toolkit on the host assumes that this has been set up for the container engine being used already. In the case of Conainerd, this implies that the nvidia
runtime class has been added to the config.toml
for Containerd with the binary set to the nvidia-container-runtime
installed on the host.
Furthermore, in the case where the driver container is used, this also assumes that the root
option in the NVIDIA Container Toolkit config (usually /etc/nvidia-container-runtime/config.toml
) has been set accordingly.
When the operator is used to manage the NVIDIA Container Toolkit, this is installed on the host and the config for Containerd (or Crio, or Docker) is updated to register the nvidia
runtime class. This does make modifications to Containerd's config.toml
on the host, but does not require the user to modify this file. The config.toml
file for the toolkit is also updated to refer to the correct root
-- depending on whether the driver container or host-installed drivers are used.
With regards to:
The customization of the containerd runtime only appears if using the host nvidia-toolkit. In the past I was able to install the GPU operator using both host Nvidia driver and toolkit and I was thinking if would be simpler to install using only the GPU operator, without thinkering with these configurations
Are you asking whether it would be simpler to let the GPU Operator manage both the Driver and Container Toolkit instead of managing these yourself? In general, the answer is yes, but may depend on your exact use cases. Note that as stated above, this would still update the Container config on the host, but would not require any user intervention. @shivamerla could provide more information on which use cases the GPU Operator may not be able to handle to help determine whether there is an argument for keeping the host drivers and / or toolkit.
Finally, with regards to:
The documentation also states that the toolkit option of the gpu operator uses nvidia-docker2, but I thought that this software was deprecated some time ago.
Yes, that package is on the deprecation path and we need to improve our documentation surrounding it. References to nvidia-docker2
have been removed from the Toolkit Container documentation, but we may have missed the places in the GPU Operator Documentation where this is discussed. In the case of Containerd specifically, only the nvidia-container-toolkit
package and its dependencies (nvidia-container-toolkit-base
, libnvidia-container-tools
, and libnvidia-container1
) are required. Also note that the next release of the nvidia-container-toolkit
package will include CLI support for updating the Containerd config on the host to register the nvidia
runtime.
Hello @elezar thanks for the detailed answer. The gpu-operator changing the containderD config files (config.toml) makes more sense and in hindsight explains the reason to not be needed to manually change these files when the operator configure both driver and nvidia toolkit. The location of these files would be given to the GPU_OPERATOR using these environment variables (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd), is that correct? I believe that adding these details to the documentation would be useful to others thanks
Hello, I've installed the gpu-operator with the operator managing both the driver and the nvidia-toolkit. It was way easier than using the host ones. The installation was little rough because of what I believe to be this bug: https://github.com/containerd/containerd/issues/7843. The installation worked without issues after upgrading.
thanks
Hello everyone, I am reading the documentation. It appears to be necessary to change the config.toml of the containerd runtime only if I use a host installed nvidia-toolkit Is that interpretation correct? The documentation also states that the toolkit option of the gpu operator uses
nvidia-docker2
, but I thought that this software was deprecated some time ago.Thanks 👍