Closed madisongh closed 3 years ago
This is due libtirpc trying to allocate arrays based on the fd table size, which has gone from thousands to billions in size. It also appears that libtripc isn't properly handling memory allocation failures in some of its code paths, leading to the segmentation faults.
You can work around the problem by explicitly using #763 patches the version of the RPC library statically linked into the container tools to cap the array sizes down to 1K, to work around the problem.--ulimit nofile=1024:4096
, or some other more reasonable limits on the docker run
command, but
(EDIT: The workaround mentioned above worked for me with the original upstream patch to libtirpc applied. You might be able to make it work without patching libtirpc at all by also setting your own process's ulimit -H 4096
, but I haven't actually tested this.)
Hey @madisongh
I tried docker
and I got the same issue as you.
root@jetson-tx2-devkit:~# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.
root@jetson-tx2-devkit:~#
To Reproduce Steps to reproduce the behavior:
# docker pull nvcr.io/nvidia/l4t-ml:r32.5.0-py3
# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3
Describe the bug The
nvidia-container-toolkit
program is crashing with a segmentation fault when trying to start a container.The segfault is happening during teardown of the RPC communication it uses, which appears to be due to the newer
libtirpc
version (1.3.2) in OE-Core master. Replacing the use of that version with a statically-linked copy of thelibtirpc
pulled from OE-Core dunfell eliminates the segfault, but setup still fails with:To Reproduce Steps to reproduce the behavior:
tegra-demo-distro
, branchmaster
demo-image-full
docker run --net=host --runtime nvidia --rm --ipc=host --cap-add SYS_PTRACE -e DISPLAY=$DISPLAY -it nvcr.io/nvidia/l4t-base:r32.5.0