NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
815 stars 200 forks source link

nvidia-container-cli does not recognize runtimes for forward compatible upgrade #128

Open akinmr opened 3 years ago

akinmr commented 3 years ago

I'm using a computer with 440.33.01 (CUDA 10.2) driver and CUDA 11.0 compatibility runtime (cuda-compat-*). Because of the driver dependency issue, we cannot upgrade the CUDA driver anymore. I'd like to use an NGC container, which is built against CUDA11 with this computer, but singularity --nv loads runtime lib for CUDA10.2 under /usr/lib64/, ignoring CUDA11 runtime under a different path specified with LD_LIBRARY_PATH. The cause of this issue seems to be that nvidia-container-cli doesn't honor LD_LIBRARY_PATH, but get libs from ld.so.cache, according to its debug output. Finally, the program inside the container crashes due to a missing symbol in libcuda.so.1. Is it possible to let nvidia-container-cli pick CUDA runtimes in a non-standard location which is specified with LD_LIBRARY_PATH?

akinmr commented 3 years ago

command_output.txt debug_log.txt

Uploaded some command line output and debug log of nvidia-container-cli list

klueska commented 3 years ago

Are you saying that you want libnvidia-container to inject the libcuda.so from a specific LD_LIBRARY_PATH instead of the libcuda.so installed by the driver? Or are you saying that libnvidia-container is just not finding the cuda-compat-* libs at all because they are in a non-standard location, and you need a way to tell libnvidia-container where they are so it will inject them?

akinmr commented 3 years ago

I'm expecting to read LD_LIBRARY_PATH to find libcuda.so location, as it's recommended way to enable compatibility libraries.

klueska commented 3 years ago

I'm still slightly confused as to what you have installed where.

Correct me if I'm wrong:

  1. You have CUDA 10.2 installed on the host at /usr/local/cuda
  2. You have cuda-compat installed on the host at /usr/local/cuda/compat
  3. You are running a CUDA 11.x container
  4. Inside the container you then run LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH <your binary>

This is the expected workflow and it should work.

klueska commented 3 years ago

You are also on a very old version of libnvidia-container (1.0.5).

It's possible you could be running into the issue resolved here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/44

This fix was included in v1.3.1

akinmr commented 3 years ago

Thanks for the info, and I will ask our sysadmin to update it. Sorry for my making you confused with the outdated installation.

akinmr commented 3 years ago

I tried version 1.3.3, but the situation did not change. command_output_133.txt debug_log_133.txt

You have CUDA 10.2 installed on the host at /usr/local/cuda

  • CUDA 10.2 runtime is installed at /apps/t3/sles12sp2/cuda/10.2.89/
  • libcuda.so from driver 440.33.01 (= CUDA 10.2) is located under /usr/lib64/ You have cuda-compat installed on the host at /usr/local/cuda/compat
  • Content of cuda-compat are located under /apps/t3/sles12sp2/cuda/11.0/
  • libcuda.so from cuda-compat is also under /apps/t3/sles12sp2/cuda/11.0/lib64/ You are running a CUDA 11.x container
  • Yes, I'm trying https://ngc.nvidia.com/catalog/containers/nvidia:hpc-benchmarks with Singularity 3.6.4 Inside the container you then run LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
  • I tried singularity shell --nv hpc-benchmarks\:20.10-hpl.sif, and executed nvidia-smi, after failed execution with symbol error.

Note that I'm using SLES 12 SP4 and using libnvidia-container-tools RPM for CentOS7.

elezar commented 3 years ago

Hi @akinmr.

When you say:

I tried singularity shell --nv hpc-benchmarks\:20.10-hpl.sif, and executed nvidia-smi, after failed execution with symbol error.

What is the LD_LIBRARY_PATH that is being used in the singularity shell? Could you run:

singularity shell --nv hpc-benchmarks\:20.10-hpl.sif
env
akinmr commented 3 years ago

env_mount_inside_container.txt I executed env and mount inside the container, (plus echo $LD_LIBRARY_PATH inside and outside) The file singularity automatically mounts under /.singularity.d/libs/libcuda.so was identical to /usr/lib64/libcuda.so outside of the container, according to md5sum output.

akinmr commented 3 years ago

Singularity picks libraries to mount from nvidia-container-cli output. https://sylabs.io/guides/3.7/user-guide/gpu.html#library-search-options

elezar commented 3 years ago

Are the compatibility libraries visible in the container? Could you explicitly set LD_LIBRARY_PATH in the container / shell to include the CUDA 11.0 compatible library location and then run nvidia-smi.

The current setting:

LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin:/usr/local/pmix/lib:/usr/local/ucx/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs

would not find either the compatibility libraries, or the libraries included in the container.

Just a note, does the singularity version you're using (3.6.4 from the logs) include the following fix: https://github.com/hpcng/singularity/commit/3e26476fa0fe08a899e4f4029ab5546b3e77215f ?

akinmr commented 3 years ago

explicit_bind_libpath.txt As pasted above, when I explicitly bind the path and set it to $LD_LIBRARY_PATH, CUDA11 libs are used as expected. What I expected is that Singularity automatically pics the libs under LD_LIBRARY_PATH (of the host) and binds them to /.singularity.d/*, with help of nvidia-container-cli . BTW, the patch of singularity you mentioned seems to be merged at v.3.6.2 according to that GitHub page, and my singularity is the fixed one.