NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.22k stars 241 forks source link

cuda-license/cuda-nvml-dev pulled indirectly in rockylinux #220

Open abellina opened 1 year ago

abellina commented 1 year ago

I am trying to install an RPM that has a dependency on libnvidia-ml.so in a Rocky linux image: https://github.com/NVIDIA/spark-rapids/blob/branch-23.04/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_no_rdma#L38

However, I am noticing that during docker build, rockylinux is pulling extra packages that it doesn't really need: cuda-license-10-1 and cuda-nvml-dev-10-1. Note that if I don't try and install this RPM and just setup a docker image based on https://github.com/NVIDIA/spark-rapids/blob/branch-23.04/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_no_rdma#L30, I see libnvidia-ml.so.1 already there, so I think this library shows up at runtime (via nvidia-docker). I would like to figure out a way to yum install this RPM without these extra packages (note also they are for CUDA 10.1, and I am trying to use 11.5+).

Note that the RPM doesn't have an explicit dependency on the cuda package, one of the shared libraries in the package links against the shared library:

ldd libuct_cuda.so
...
      libcudart.so.11.0 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007f20c076f000)
      libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 (0x00007f20bf7d6000)
...

I have a workaround using rpm --nodeps but I wanted to get some feedback to hear whether this issue is expected, and what the correct approach should be. Thanks in advance.

elezar commented 1 year ago

Note that libnvidia-ml.so is part of the driver installation on the host and should not be included in the image. This file is injected into the container on create by the NVIDIA container stack.

There may be a stub for this library available which you could use if you need to link against it at build time. It might be worth working with openucx community to remove the direct dependency if one exists.

abellina commented 1 year ago

Thank you @elezar, note here is the output of rpm -qp --requires.

The question I had is this specifies libcuda.so in a very similar way to libnvidia-ml.so. Is libcuda.so available at image build time? E.g. this package doesn't require specifically cuda-11-* yet it indirectly does via the shared libraries here.

$ rpm -qp ucx-cuda-1.14.0-1.el8.x86_64.rpm --requires
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libcuda.so.1()(64bit)
libcudart.so.11.0()(64bit)
libcudart.so.11.0(libcudart.so.11.0)(64bit)
libdl.so.2()(64bit)
libdl.so.2(GLIBC_2.2.5)(64bit)
libm.so.6()(64bit)
libnuma.so.1()(64bit)
libnvidia-ml.so.1()(64bit)
libpthread.so.0()(64bit)
libpthread.so.0(GLIBC_2.2.5)(64bit)
librt.so.1()(64bit)
libucm.so.0()(64bit)
libucs.so.0()(64bit)
libuct.so.0()(64bit)
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsXz) <= 5.2-1
rtld(GNU_HASH)
ucx(x86-64) = 1.14.0-1.el8
elezar commented 1 year ago

Note that these are runtime dependencies. Furthermore, the libcuda.so and libnvidia-ml.so libraries are part of the driver installation that are injected by the NVIDIA container stack when a container is run. If these dependencies are required at build time, stubs should be used instead.

If I recall correctly, there are stubs included in the nvidia/cuda images under /usr/local/cuda, so consider using these instead?

It may still be a good idea to reach out to the package maintainers to relax this requirement for us in containerized environments.