Open akinmr opened 3 years ago
command_output.txt debug_log.txt
Uploaded some command line output and debug log of nvidia-container-cli list
Are you saying that you want libnvidia-container
to inject the libcuda.so
from a specific LD_LIBRARY_PATH
instead of the libcuda.so
installed by the driver? Or are you saying that libnvidia-container
is just not finding the cuda-compat-*
libs at all because they are in a non-standard location, and you need a way to tell libnvidia-container
where they are so it will inject them?
I'm expecting to read LD_LIBRARY_PATH
to find libcuda.so
location, as it's recommended way to enable compatibility libraries.
I'm still slightly confused as to what you have installed where.
Correct me if I'm wrong:
LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH <your binary>
This is the expected workflow and it should work.
You are also on a very old version of libnvidia-container
(1.0.5).
It's possible you could be running into the issue resolved here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/44
This fix was included in v1.3.1
Thanks for the info, and I will ask our sysadmin to update it. Sorry for my making you confused with the outdated installation.
I tried version 1.3.3, but the situation did not change. command_output_133.txt debug_log_133.txt
You have CUDA 10.2 installed on the host at /usr/local/cuda
- CUDA 10.2 runtime is installed at
/apps/t3/sles12sp2/cuda/10.2.89/
- libcuda.so from driver 440.33.01 (= CUDA 10.2) is located under
/usr/lib64/
You have cuda-compat installed on the host at /usr/local/cuda/compat- Content of cuda-compat are located under
/apps/t3/sles12sp2/cuda/11.0/
- libcuda.so from cuda-compat is also under
/apps/t3/sles12sp2/cuda/11.0/lib64/
You are running a CUDA 11.x container- Yes, I'm trying https://ngc.nvidia.com/catalog/containers/nvidia:hpc-benchmarks with Singularity 3.6.4 Inside the container you then run LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
- I tried
singularity shell --nv hpc-benchmarks\:20.10-hpl.sif
, and executednvidia-smi
, after failed execution with symbol error.
Note that I'm using SLES 12 SP4 and using libnvidia-container-tools RPM for CentOS7.
Hi @akinmr.
When you say:
I tried singularity shell --nv hpc-benchmarks\:20.10-hpl.sif, and executed nvidia-smi, after failed execution with symbol error.
What is the LD_LIBRARY_PATH
that is being used in the singularity shell? Could you run:
singularity shell --nv hpc-benchmarks\:20.10-hpl.sif
env
env_mount_inside_container.txt I executed env and mount inside the container, (plus echo $LD_LIBRARY_PATH inside and outside) The file singularity automatically mounts under /.singularity.d/libs/libcuda.so was identical to /usr/lib64/libcuda.so outside of the container, according to md5sum output.
Singularity picks libraries to mount from nvidia-container-cli output. https://sylabs.io/guides/3.7/user-guide/gpu.html#library-search-options
Are the compatibility libraries visible in the container? Could you explicitly set LD_LIBRARY_PATH
in the container / shell to include the CUDA 11.0 compatible library location and then run nvidia-smi
.
The current setting:
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin:/usr/local/pmix/lib:/usr/local/ucx/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
would not find either the compatibility libraries, or the libraries included in the container.
Just a note, does the singularity version you're using (3.6.4 from the logs) include the following fix: https://github.com/hpcng/singularity/commit/3e26476fa0fe08a899e4f4029ab5546b3e77215f ?
explicit_bind_libpath.txt As pasted above, when I explicitly bind the path and set it to $LD_LIBRARY_PATH, CUDA11 libs are used as expected. What I expected is that Singularity automatically pics the libs under LD_LIBRARY_PATH (of the host) and binds them to /.singularity.d/*, with help of nvidia-container-cli . BTW, the patch of singularity you mentioned seems to be merged at v.3.6.2 according to that GitHub page, and my singularity is the fixed one.
I'm using a computer with 440.33.01 (CUDA 10.2) driver and CUDA 11.0 compatibility runtime (cuda-compat-*). Because of the driver dependency issue, we cannot upgrade the CUDA driver anymore. I'd like to use an NGC container, which is built against CUDA11 with this computer, but
singularity --nv
loads runtime lib for CUDA10.2 under /usr/lib64/, ignoring CUDA11 runtime under a different path specified with LD_LIBRARY_PATH. The cause of this issue seems to be thatnvidia-container-cli
doesn't honorLD_LIBRARY_PATH
, but get libs from ld.so.cache, according to its debug output. Finally, the program inside the container crashes due to a missing symbol in libcuda.so.1. Is it possible to letnvidia-container-cli
pick CUDA runtimes in a non-standard location which is specified withLD_LIBRARY_PATH
?