Container support broken on master

madisongh commented 3 years ago

Describe the bug The nvidia-container-toolkit program is crashing with a segmentation fault when trying to start a container.

The segfault is happening during teardown of the RPC communication it uses, which appears to be due to the newer libtirpc version (1.3.2) in OE-Core master. Replacing the use of that version with a statically-linked copy of the libtirpc pulled from OE-Core dunfell eliminates the segfault, but setup still fails with:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.

To Reproduce Steps to reproduce the behavior:

Use tegra-demo-distro, branch master
Build demo-image-full
Load onto target (tested with Xavier NX devkit)
Try docker run --net=host --runtime nvidia --rm --ipc=host --cap-add SYS_PTRACE -e DISPLAY=$DISPLAY -it nvcr.io/nvidia/l4t-base:r32.5.0

madisongh commented 3 years ago

This is due libtirpc trying to allocate arrays based on the fd table size, which has gone from thousands to billions in size. It also appears that libtripc isn't properly handling memory allocation failures in some of its code paths, leading to the segmentation faults.

~~You can work around the problem by explicitly using --ulimit nofile=1024:4096 , or some other more reasonable limits on the docker run command, but~~ #763 patches the version of the RPC library statically linked into the container tools to cap the array sizes down to 1K, to work around the problem.

(EDIT: The workaround mentioned above worked for me with the original upstream patch to libtirpc applied. You might be able to make it work without patching libtirpc at all by also setting your own process's ulimit -H 4096, but I haven't actually tested this.)

ichergui commented 3 years ago

Hey @madisongh

I tried docker and I got the same issue as you.

Here is the logs

root@jetson-tx2-devkit:~# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.
root@jetson-tx2-devkit:~#

To Reproduce Steps to reproduce the behavior:

Use tegra-demo-distro, branch master
Build demo-image-full
Load onto target (tested with Jetson TX2 devkit)

Try the following commands

# docker pull nvcr.io/nvidia/l4t-ml:r32.5.0-py3
# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3

OE4T / meta-tegra

Container support broken on master #760