NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.4k stars 259 forks source link

Jetson: `libcudnn_adv_infer_static_v8.a: file exists: unknown` error #274

Open ben-xD opened 2 years ago

ben-xD commented 2 years ago

Problem

Support for Jetson platforms has been in beta for more than a year. Unfortunately, the following simple command does not work on my Jetson. Fortunately, this is very easy to reproduce, just run this:

docker run --runtime nvidia -it nvcr.io/nvidia/tensorrt:21.12-py3

Error

You will get:

$ docker run --runtime nvidia -it nvcr.io/nvidia/tensorrt:21.12-py3
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook NVIDIA/nvidia-docker#1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/63a5dc0a46e4f12d052b60007e09f18b5bb773903c054915cf6f843392531b40/merged/usr/lib/aarch64-linux-gnu/libcudnn_adv_infer_static_v8.a: file exists: unknown.
ERRO[0006] error waiting for container: context canceled

And let me pick out the juiciest part: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/id/merged/usr/lib/aarch64-linux-gnu/libcudnn_adv_infer_static_v8.a: file exists: unknown

My details:

Running `nvidia-container-cli -k -d /dev/tty info` ``` -- WARNING, the following logs are for debugging purposes only -- I0128 16:30:01.821611 9072 nvc.c:281] initializing library context (version=0.10.0+jetpack, build=61f57bcdf7aa6e73d9a348a7e36ec9fd94128fb2) I0128 16:30:01.821757 9072 nvc.c:255] using root / I0128 16:30:01.821803 9072 nvc.c:256] using ldcache /etc/ld.so.cache I0128 16:30:01.821874 9072 nvc.c:257] using unprivileged user 1002:1002 I0128 16:30:01.822601 9073 driver.c:134] starting driver service I0128 16:30:01.831030 9072 driver.c:231] driver service terminated with signal 15 nvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected ```
`jetson_release -v` ``` - NVIDIA Jetson AGX Xavier [16GB] * Jetpack 4.5.1 [L4T 32.5.2] * NV Power Mode: MAXN - Type: 0 * jetson_stats.service: active - Board info: * Type: AGX Xavier [16GB] * SOC Family: tegra194 - ID:25 * Module: P2888-0001 - Board: P2822-0000 * Code Name: galen * CUDA GPU architecture (ARCH_BIN): 7.2 * Serial Number: 1421021087906 - Libraries: * CUDA: 10.2.89 * cuDNN: 8.0.0.180 * TensorRT: 7.1.3.0 * Visionworks: 1.6.0.501 * OpenCV: NOT_INSTALLED compiled CUDA: NO * VPI: ii libnvvpi1 1.0.15 arm64 NVIDIA Vision Programming Interface library * Vulkan: 1.2.70 - jetson-stats: * Version 3.1.2 * Works on Python 3.6.9 ```

My uname -a:

Linux desktopPC 4.9.201-tegra NVIDIA/nvidia-docker#1 SMP PREEMPT Wed May 5 09:32:25 PDT 2021 aarch64 aarch64 aarch64 GNU/Linux
`docker version` ``` Client: Version: 20.10.7 API version: 1.41 Go version: go1.13.8 Git commit: 20.10.7-0ubuntu5~18.04.3 Built: Mon Nov 1 01:04:31 2021 OS/Arch: linux/arm64 Context: default Experimental: true Server: Engine: Version: 20.10.7 API version: 1.41 (minimum version 1.12) Go version: go1.13.8 Git commit: 20.10.7-0ubuntu5~18.04.3 Built: Fri Oct 22 00:57:37 2021 OS/Arch: linux/arm64 Experimental: false containerd: Version: 1.5.5-0ubuntu3~18.04.1 GitCommit: runc: Version: 1.0.1-0ubuntu2~18.04.1 GitCommit: docker-init: Version: 0.19.0 GitCommit: ```
My `dpkg -l '*nvidia*'` ``` Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-====================================-=======================-=======================-============================================================================= un libgldispatch0-nvidia (no description available) ii libnvidia-container-tools 1.7.0-1 arm64 NVIDIA container runtime library (command-line tools) ii libnvidia-container0:arm64 0.10.0+jetpack arm64 NVIDIA container runtime library ii libnvidia-container1:arm64 1.7.0-1 arm64 NVIDIA container runtime library un nvidia-304 (no description available) un nvidia-340 (no description available) un nvidia-384 (no description available) un nvidia-common (no description available) ii nvidia-container-csv-cuda 10.2.89-1 arm64 Jetpack CUDA CSV file ii nvidia-container-csv-cudnn 8.0.0.180-1+cuda10.2 arm64 Jetpack CUDNN CSV file ii nvidia-container-csv-tensorrt 7.1.3.0-1+cuda10.2 arm64 Jetpack TensorRT CSV file ii nvidia-container-csv-visionworks 1.6.0.501 arm64 Jetpack VisionWorks CSV file un nvidia-container-runtime (no description available) un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.7.0-1 arm64 NVIDIA container runtime hook un nvidia-cuda-dev (no description available) un nvidia-docker (no description available) ii nvidia-docker2 2.8.0-1 all nvidia-docker CLI wrapper ii nvidia-l4t-3d-core 32.5.2-20210709090156 arm64 NVIDIA GL EGL Package ii nvidia-l4t-apt-source 32.5.2-20210709090156 arm64 NVIDIA L4T apt source list debian package ii nvidia-l4t-bootloader 32.5.2-20210709090156 arm64 NVIDIA Bootloader Package ii nvidia-l4t-camera 32.5.2-20210709090156 arm64 NVIDIA Camera Package un nvidia-l4t-ccp-t186ref (no description available) ii nvidia-l4t-configs 32.5.2-20210709090156 arm64 NVIDIA configs debian package ii nvidia-l4t-core 32.5.2-20210709090156 arm64 NVIDIA Core Package ii nvidia-l4t-cuda 32.5.2-20210709090156 arm64 NVIDIA CUDA Package ii nvidia-l4t-firmware 32.5.2-20210709090156 arm64 NVIDIA Firmware Package ii nvidia-l4t-graphics-demos 32.5.2-20210709090156 arm64 NVIDIA graphics demo applications ii nvidia-l4t-gstreamer 32.5.2-20210709090156 arm64 NVIDIA GST Application files ii nvidia-l4t-init 32.5.2-20210709090156 arm64 NVIDIA Init debian package ii nvidia-l4t-initrd 32.5.2-20210709090156 arm64 NVIDIA initrd debian package ii nvidia-l4t-jetson-io 32.5.2-20210709090156 arm64 NVIDIA Jetson.IO debian package ii nvidia-l4t-jetson-multimedia-api 32.5.2-20210709090156 arm64 NVIDIA Jetson Multimedia API is a collection of lower-level APIs that support ii nvidia-l4t-kernel 4.9.201-tegra-32.5.2-20 arm64 NVIDIA Kernel Package ii nvidia-l4t-kernel-dtbs 4.9.201-tegra-32.5.2-20 arm64 NVIDIA Kernel DTB Package ii nvidia-l4t-kernel-headers 4.9.201-tegra-32.5.2-20 arm64 NVIDIA Linux Tegra Kernel Headers Package ii nvidia-l4t-libvulkan 32.5.2-20210709090156 arm64 NVIDIA Vulkan Loader Package ii nvidia-l4t-multimedia 32.5.2-20210709090156 arm64 NVIDIA Multimedia Package ii nvidia-l4t-multimedia-utils 32.5.2-20210709090156 arm64 NVIDIA Multimedia Package ii nvidia-l4t-oem-config 32.5.2-20210709090156 arm64 NVIDIA OEM-Config Package ii nvidia-l4t-tools 32.5.2-20210709090156 arm64 NVIDIA Public Test Tools Package ii nvidia-l4t-wayland 32.5.2-20210709090156 arm64 NVIDIA Wayland Package ii nvidia-l4t-weston 32.5.2-20210709090156 arm64 NVIDIA Weston Package ii nvidia-l4t-x11 32.5.2-20210709090156 arm64 NVIDIA X11 Package ii nvidia-l4t-xusb-firmware 32.5.2-20210709090156 arm64 NVIDIA USB Firmware Package un nvidia-libopencl1-dev (no description available) un nvidia-prime (no description available) ```
nvidia-container-cli -V ``` cli-version: 1.7.0 lib-version: 0.10.0+jetpack build date: 2021-11-30T19:53+00:00 build revision: f37bb387ad05f6e501069d99e4135a97289faf1f build compiler: aarch64-linux-gnu-gcc-7 7.5.0 build platform: aarch64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections ```
`/var/log/nvidia-container-runtime.log` **logs** ``` 2022/01/28 18:33:07 Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/9b8ee90ce1ae157106936445c2c429ab33f6293d4a2f9e1c400b60917d597a97 2022/01/28 18:33:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/9b8ee90ce1ae157106936445c2c429ab33f6293d4a2f9e1c400b60917d597a97/config.json 2022/01/28 18:33:07 Looking for runtime binary 'docker-runc' 2022/01/28 18:33:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH 2022/01/28 18:33:07 Looking for runtime binary 'runc' 2022/01/28 18:33:07 Found runtime binary '/usr/sbin/runc' 2022/01/28 18:33:07 Running nvidia-container-runtime 2022/01/28 18:33:07 'create' command detected; modification required 2022/01/28 18:33:07 prestart hook path: /usr/bin/nvidia-container-runtime-hook 2022/01/28 18:33:07 Forwarding command to runtime 2022/01/28 18:33:07 Using bundle directory: 2022/01/28 18:33:07 Using OCI specification file path: config.json 2022/01/28 18:33:07 Looking for runtime binary 'docker-runc' 2022/01/28 18:33:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH 2022/01/28 18:33:07 Looking for runtime binary 'runc' 2022/01/28 18:33:07 Found runtime binary '/usr/sbin/runc' 2022/01/28 18:33:07 Running nvidia-container-runtime 2022/01/28 18:33:07 No modification required 2022/01/28 18:33:07 Forwarding command to runtime ```
`dmesg` ``` [782180.143499] docker0: port 1(veth7284103) entered blocking state [782180.143505] docker0: port 1(veth7284103) entered disabled state [782180.143995] device veth7284103 entered promiscuous mode [782180.153901] IPv6: ADDRCONF(NETDEV_UP): veth7284103: link is not ready [782180.579633] eth0: renamed from veth7260057 [782180.602680] IPv6: ADDRCONF(NETDEV_CHANGE): veth7284103: link becomes ready [782180.603100] docker0: port 1(veth7284103) entered blocking state [782180.603108] docker0: port 1(veth7284103) entered forwarding state [782185.791137] docker0: port 1(veth7284103) entered disabled state [782185.791615] veth7260057: renamed from eth0 [782185.854805] docker0: port 1(veth7284103) entered disabled state [782185.864577] device veth7284103 left promiscuous mode [782185.864587] docker0: port 1(veth7284103) entered disabled state ```

Things I found

ben-xD commented 2 years ago

I had come across https://github.com/NVIDIA/nvidia-docker/issues/825#issuecomment-456198590:

as part of v2 we prevent the container from starting if you have the NVIDIA driver

I am not sure if my issue is related to this. Perhaps @RenaudWasTaken would know? Thanks in advance :)

klueska commented 2 years ago

In general, only the containers packaged for l4t (in this case l4t-tensorrt) are designed to work on jetson machines, e.g.:

docker run --runtime nvidia -it nvcr.io/nvidia/l4t-tensorrt:r8.0.1-runtime

This is because these containers rely on the host to inject all cuda and other support files into the container at runtime instead of bundling them inside the container image (keeping the container images themselves relatively small). The error you are seeing is because a file already bundled in the container image is attempting to be injected into the container at runtime by the container stack.

That said, you may be able to leverage a new feature of the container support for jetson that limits the set of files ultimately injected into a container to only the base l4t base files. That wy you can run any container build for arm that just needs these base files in order to run.

You can do this by setting the following environment variable when you start the container.

NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=base-only

i.e.

$ docker run --runtime nvidia -it -e NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=base-only nvcr.io/nvidia/tensorrt:21.12-py3