OE4T / meta-tegra

BSP layer for NVIDIA Jetson platforms, based on L4T
MIT License
400 stars 221 forks source link

Container support broken on master #760

Closed madisongh closed 3 years ago

madisongh commented 3 years ago

Describe the bug The nvidia-container-toolkit program is crashing with a segmentation fault when trying to start a container.

The segfault is happening during teardown of the RPC communication it uses, which appears to be due to the newer libtirpc version (1.3.2) in OE-Core master. Replacing the use of that version with a statically-linked copy of the libtirpc pulled from OE-Core dunfell eliminates the segfault, but setup still fails with:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.

To Reproduce Steps to reproduce the behavior:

  1. Use tegra-demo-distro, branch master
  2. Build demo-image-full
  3. Load onto target (tested with Xavier NX devkit)
  4. Try docker run --net=host --runtime nvidia --rm --ipc=host --cap-add SYS_PTRACE -e DISPLAY=$DISPLAY -it nvcr.io/nvidia/l4t-base:r32.5.0
madisongh commented 3 years ago

This is due libtirpc trying to allocate arrays based on the fd table size, which has gone from thousands to billions in size. It also appears that libtripc isn't properly handling memory allocation failures in some of its code paths, leading to the segmentation faults.

You can work around the problem by explicitly using --ulimit nofile=1024:4096 , or some other more reasonable limits on the docker run command, but #763 patches the version of the RPC library statically linked into the container tools to cap the array sizes down to 1K, to work around the problem.

(EDIT: The workaround mentioned above worked for me with the original upstream patch to libtirpc applied. You might be able to make it work without patching libtirpc at all by also setting your own process's ulimit -H 4096, but I haven't actually tested this.)

ichergui commented 3 years ago

Hey @madisongh

I tried docker and I got the same issue as you.

To Reproduce Steps to reproduce the behavior:

  1. Use tegra-demo-distro, branch master
  2. Build demo-image-full
  3. Load onto target (tested with Jetson TX2 devkit)
  4. Try the following commands
    # docker pull nvcr.io/nvidia/l4t-ml:r32.5.0-py3
    # docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3