NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
1.88k stars 214 forks source link

Could nvidia-container-runtime work without ld.so.cache #71

Open kochjr opened 1 year ago

kochjr commented 1 year ago

Would it be possible for nvidia-container-runtime to work without the use of ld.so.cache or ldconfig? I got this to work in buildroot but it requires enabling BR2_PACKAGE_GLIBC_UTILS to get ldconfig and generating the cache file. It is common that embedded Linux systems don't include these because libraries are kept in a single flat directory (e.g. /usr/lib).

elezar commented 1 year ago

@kochjr we are working on expanding the CDI support in the the NVIDIA Container Toolkit and this should allow the functionality that you're proposing.

As soon as v1.12.0-rc.2 is released I will update this issue with some instructions for early testing.

As a matter of interest, which container engine (docker, podman, containerd) are you using in this case?

kochjr commented 1 year ago

Thanks for the quick reply and I look forward to trying this when it is ready. With my current version of Buildroot LTS 2022.02 I am using Docker 20.10.14 and more specific version info in case it matters:

Client:
 Version:           20.10.14
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        20.10.14
 Built:             unknown-buildtime
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.14
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       buildroot
  Built:            
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.11
  GitCommit:        
 runc:
  Version:          1.1.2
  GitCommit:   

and the runtime version:

NVIDIA Container Runtime version 1.11.0
commit: d9de4a0
spec: 1.0.2-dev

runc version 1.1.2
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.3
elezar commented 1 year ago

Sorry for the delay in getting back to this.

A follow-up question would be -- what would the expectation be once the libraries are mounted into the container? Would the user have to update the ldcache in the container manually?

Note, that it is possible to set the ldconfig path in the /etc/nvidia-container-runtime/config.toml file. If this is replaced with a no-op wrapper that can still be called, could this be used as a workaround?

kochjr commented 12 months ago

Sorry for the delay. Ideally the user wouldn't have to update ldcache within any container leveraging the GPU. Instead any required dependencies from the host would be passed in/accessible as needed or paths could be arguments to the runtime setup. When previously trying to make this work using buildroot I had to boot the OS for the first time (after building from scratch) and run ldconfig to generate the ld.so.cache. Then containers would run just fine. However, having ldconfig and ld.so.cache installed in embedded systems isn't common. If there was a way that this could be a no-op and/or be passed at compilation time of the OS and applications that would be beneficial. I can try and take a look at this again and provide more details.

elezar commented 12 months ago

@kochjr I think what you're describing is doable by the Container Device Interface (CDI). In more recent versions of the NVIDIA Container Toolkit we have the nvidia-ctk cdi generate command which will generate a CDI specification for the devices available on the system. This can be updated to suit your configuration -- for example location of the drivers on the host and desired paths in the container.

kochjr commented 11 months ago

I can give that a shot but does docker support CDI yet? I have seen threads where you have been communicating with docker developers about adding that capability to be more compatible with other container engines. Any details or help there would be appreciated.

I have since upgraded my Buildroot version to 2023.02 LTS and to the 1.13.1 version of Nvidia toolkit (libnvidia-container and nvidia-container-toolkit). Currently I am unable to easily upgrade past that because starting with version 1.13.2 I believe it is required to have Go >=1.20.x and currently I am using the latest available version of Go (1.19.7) for Buildroot.

Without /etc/ld.so.cache being present the error I get when trying to run a GPU enabled container is:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: open failed: /etc/ld.so.cache: no such file or directory: unknown.

If I install ldconfig (not desirable) and then generate /etc/ld.so.cache the container works. I think it also works if I generate /etc/ld.so.cache and then remove ldconfig. Ultimately I would like to have a configuration file that is present in my target image that is generated in my host (complier) environment. That target image should be able to be installed onto computers of the same configuration and run docker containers without having to perform a manual step of generating the ld.so.cache, extract the file, and then copying that into my build environment to build into the image.

Here is my current configuration: docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 nvidia
 Default Runtime: runc
 Init Binary: /usr/bin/docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  seccomp
   Profile: builtin
 Kernel Version: 6.1.14
 Operating System: Buildroot 2023.02
 OSType: linux
 Architecture: x86_64
 CPUs: 20
 Total Memory: 31.2GiB
 Name: buildroot
 ID: 13326abf-792b-4954-b926-7c21d03baa93
 Docker Root Dir: /internal/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

nvidia-ctk --version

NVIDIA Container Toolkit CLI version 1.13.1
commit: 28b7066

nvidia-container-runtime --version

NVIDIA Container Runtime version 1.13.1
commit: 28b7066
spec: 1.0.2-dev

runc version 1.1.4
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.4

nvidia-container-cli -V

cli-version: 1.13.1
lib-version: 1.13.1
build date: 2023-07-20T19:42+00:00
build revision: 6f4aea0fca16aaff01bab2567adb34ec30847a0e
build compiler: toolchain-wrapper 11.3.0
build platform: x86_64
jerrykuku commented 10 months ago

I can give that a shot but does docker support CDI yet? I have seen threads where you have been communicating with docker developers about adding that capability to be more compatible with other container engines. Any details or help there would be appreciated.我可以尝试一下,但是 docker 支持 CDI 吗?我看到您一直在与 docker 开发人员沟通添加该功能以与其他容器引擎更加兼容的线程。任何细节或帮助将不胜感激。

I have since upgraded my Buildroot version to 2023.02 LTS and to the 1.13.1 version of Nvidia toolkit (libnvidia-container and nvidia-container-toolkit). Currently I am unable to easily upgrade past that because starting with version 1.13.2 I believe it is required to have Go >=1.20.x and currently I am using the latest available version of Go (1.19.7) for Buildroot.此后,我将 Buildroot 版本升级到 2023.02 LTS 和 Nvidia 工具包(libnvidia-container 和 nvidia-container-toolkit)的 1.13.1 版本。目前我无法轻松升级到过去,因为从版本 1.13.2 开始,我相信需要 Go >=1.20.x,目前我正在使用最新可用版本的 Go(1.19.7)作为 Buildroot。

Without /etc/ld.so.cache being present the error I get when trying to run a GPU enabled container is:如果不存在 /etc/ld.so.cache,我在尝试运行启用 GPU 的容器时遇到的错误是:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: open failed: /etc/ld.so.cache: no such file or directory: unknown.

If I install ldconfig (not desirable) and then generate /etc/ld.so.cache the container works. I think it also works if I generate /etc/ld.so.cache and then remove ldconfig. Ultimately I would like to have a configuration file that is present in my target image that is generated in my host (complier) environment. That target image should be able to be installed onto computers of the same configuration and run docker containers without having to perform a manual step of generating the ld.so.cache, extract the file, and then copying that into my build environment to build into the image.如果我安装 ldconfig (不需要)然后生成 /etc/ld.so.cache 容器就可以工作。我认为如果我生成 /etc/ld.so.cache 然后删除 ldconfig ,它也有效。最终,我希望在我的主机(编译器)环境中生成的目标映像中存在一个配置文件。该目标映像应该能够安装到具有相同配置的计算机上并运行 docker 容器,而无需执行生成 ld.so.cache 的手动步骤,提取文件,然后将其复制到我的构建环境中以构建到图片。

Here is my current configuration:这是我当前的配置: docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 nvidia
 Default Runtime: runc
 Init Binary: /usr/bin/docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  seccomp
   Profile: builtin
 Kernel Version: 6.1.14
 Operating System: Buildroot 2023.02
 OSType: linux
 Architecture: x86_64
 CPUs: 20
 Total Memory: 31.2GiB
 Name: buildroot
 ID: 13326abf-792b-4954-b926-7c21d03baa93
 Docker Root Dir: /internal/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

nvidia-ctk --version

NVIDIA Container Toolkit CLI version 1.13.1
commit: 28b7066

nvidia-container-runtime --version

NVIDIA Container Runtime version 1.13.1
commit: 28b7066
spec: 1.0.2-dev

runc version 1.1.4
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.4

nvidia-container-cli -V

cli-version: 1.13.1
lib-version: 1.13.1
build date: 2023-07-20T19:42+00:00
build revision: 6f4aea0fca16aaff01bab2567adb34ec30847a0e
build compiler: toolchain-wrapper 11.3.0
build platform: x86_64

Hi , Can you share how to built nvidia-container-runtime in buildroot? I failed the build and it would be awesome if you could share.

kochjr commented 10 months ago

Hey @jerrykuku this probably should be in a different thread but I will give you a quick high level rundown of what I did to get the nvidia-container-toolkit working in buildroot. Please note that I am using the 2023.02 LTS and there are probably many things that could be improved (I'm not a buildroot expert). Also since Buildroot 2023.02 LTS currently only supports Go 1.19 I am only using version 1.13.1 of libnvidia-container and nvidia-container-toolkit because above that requires Go >= 1.20 for the new error wrapping support.

1) Create a libnvidia-container package 2) Patch the mk/common.mk file replace the REVISION variable to the hardcoded hash because I download the tar file (versus cloning the git repo) 3) Create a nvidia-container-toolkit package 4) Create a new (or update) nvidia-driver package to get something more modern than what Buildroot has out of the box

The important part is probably the package makefile. Here is what I did for my nvidia-container-toolkit.mk file (which could probably be improved a lot):

################################################################################
#
# nvidia-container-toolkit
#
################################################################################

NVIDIA_CONTAINER_TOOLKIT_VERSION = 1.13.1
NVIDIA_CONTAINER_TOOLKIT_GIT_COMMIT = 28b7066
NVIDIA_CONTAINER_TOOLKIT_SITE = $(call github,NVIDIA,nvidia-container-toolkit,v$(NVIDIA_CONTAINER_TOOLKIT_VERSION))

NVIDIA_CONTAINER_TOOLKIT_LICENSE = Apache-2.0
NVIDIA_CONTAINER_TOOLKIT_LICENSE_FILES = LICENSE

NVIDIA_CONTAINER_TOOLKIT_CLI_VERSION_PACKAGE = github.com/NVIDIA/nvidia-container-toolkit/internal/info
NVIDIA_CONTAINER_TOOLKIT_BUILD_TARGETS = cmd/nvidia-container-runtime cmd/nvidia-container-runtime-hook cmd/nvidia-ctk
NVIDIA_CONTAINER_TOOLKIT_LDFLAGS = -v -s -w -X '$(NVIDIA_CONTAINER_TOOLKIT_CLI_VERSION_PACKAGE).gitCommit=$(NVIDIA_CONTAINER_TOOLKIT_GIT_COMMIT)' \
    -X '$(NVIDIA_CONTAINER_TOOLKIT_CLI_VERSION_PACKAGE).version=$(NVIDIA_CONTAINER_TOOLKIT_VERSION)' \
    -extldflags=-Wl,-z,lazy
NVIDIA_CONTAINER_TOOLKIT_TAGS = cgo static_build
NVIDIA_CONTAINER_TOOLKIT_INSTALL_BINS = nvidia-container-runtime nvidia-container-runtime-hook nvidia-ctk

define NVIDIA_CONTAINER_TOOLKIT_INSTALL_SUPPORT
    ln -fs /usr/bin/nvidia-container-runtime-hook \
        $(TARGET_DIR)/usr/bin/nvidia-container-toolkit
    $(INSTALL) -D -m 644 $(@D)/oci-nvidia-hook.json \
        $(TARGET_DIR)/usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
    $(INSTALL) -D -m 755 $(@D)/oci-nvidia-hook \
        $(TARGET_DIR)/usr/libexec/oci/hooks.d/oci-nvidia-hook
endef

NVIDIA_CONTAINER_TOOLKIT_POST_INSTALL_TARGET_HOOKS += NVIDIA_CONTAINER_TOOLKIT_INSTALL_SUPPORT

$(eval $(golang-package))

Here is my Config.in file:

config BR2_PACKAGE_NVIDIA_CONTAINER_TOOLKIT
        bool "nvidia-container-toolkit"
        depends on BR2_PACKAGE_HOST_GO_TARGET_ARCH_SUPPORTS
        depends on BR2_PACKAGE_HOST_GO_TARGET_CGO_LINKING_SUPPORTS
        depends on BR2_TOOLCHAIN_HAS_THREADS
        depends on BR2_TOOLCHAIN_USES_GLIBC # fexecve
        select BR2_PACKAGE_LIBNVIDIA_CONTAINER
        select BR2_PACKAGE_NVIDIA_CONTAINER_RUNTIME
        help
          NVIDIA Container Toolkit is a OCI-spec hook for
          support for mounting GPUs into containers.

          https://github.com/NVIDIA/nvidia-container-toolkit

comment "nvidia-container-toolkit needs a glibc toolchain w/ threads"
        depends on BR2_PACKAGE_HOST_GO_TARGET_ARCH_SUPPORTS && \
            BR2_PACKAGE_HOST_GO_TARGET_CGO_LINKING_SUPPORTS
        depends on !BR2_TOOLCHAIN_HAS_THREADS || \
            !BR2_TOOLCHAN_USES_GLIBC
jerrykuku commented 10 months ago

Thank you very much for sharing, I will test it the way you did and I will give you feedback if it works.