Closed dmagdavector closed 3 months ago
Hi @dmagdavector managing links would still need to be done, as this is where the CUDA libraries will be found by our software stack.
CUDA libraries need to be found for CUDA executable to be found, and OS libraries must not be found to avoid incompatibilities (things like different glibc versions), so CUDA and OS libraries must be findable in disjoint directories.
More information available here in issues https://github.com/ComputeCanada/software-stack/issues/58 and https://github.com/ComputeCanada/software-stack/issues/79
CUDA libraries need to be found for CUDA executable to be found, and OS libraries must not be found to avoid incompatibilities (things like different glibc versions) […]
Isn't this what rpath is for?
From the example section:
$ cc -shared -Wl,-soname,termcap.so.4,-rpath,/lib/termcap.so.4 -o termcap.so.4
$ objdump -x termcap.so.4 NEEDED libc.so.6 SONAME termcap.so.4 RPATH /lib/termcap.so.4
In this example, GNU or Sun ld (ld.so) will REFUSE to load termcap for a program needing it unless the file termcap.so is in /lib/ and named termcap.so.4. LD_LIBRARY_PATH is ignored. […]
Would it be possible to say "for libcuda.so.1
use Path X and for libwhatever.so.2
use Path Y"?
Yes, but we can't have /usr/lib64
in the RPATH, otherwise it will also find all other system libraries. RPATH is not specific to each dependent library.
There are many things going on here, it's complicated!
/proc
method will not work for that.dlopen
libcuda.so
so libcuda.so
needs to be found somewhere in the standard search path, in our case this is RPATH, LD_LIBRARY_PATH, RUNPATH, libraries listed in ldconfig -p
, $EPREFIX/usr/lib64
, /usr/lib64/nvidia
, in that order. We are not using RPATH=/usr/lib64/nvidia
, since it's already there (configure via the Gentoo Prefix' glibc's compiled-in search paths)/usr/lib64/nvidia
, but you then need to put the directory you are using in LD_LIBRARY_PATH
, and bypass some of our lmod logic by setting RSNT_CUDA_DRIVER_VERSION e.g. you could do something like (untested)
export RSNT_CUDA_DRIVER_VERSION=$(cat /proc/driver/nvidia/version)
export LD_LIBRARY_PATH=<path where you have libcuda.so>
/usr/lib64/nvidia
isn't solely used for libcuda.so
but also for EGL and OpenGL (GLX), when e.g. running VirtualGL or ParaView on GPU nodes; EGL and glvnd mechanisms look for e.g. libEGL_nvidia.so.0
, which in turn is linked to a number of other libraries there.I will add that, while you can set export LD_LIBRARY_PATH=<path where you have libcuda.so>
, this absolutely can NOT be /usr/lib64
(i.e. you can't set LD_LIBRARY_PATH=/usr/lib64
, that will break everything.
In the
SitePackage.lua
file there is a check forlibcuda.so
:However, as mentioned in the documentation:
this path is not used in more recent Nvidia drivers, and so after installing the driver packages, links need to be created so that the DRAC/CC can do its auto-detection magic.
It seems that Nvidia has an official way to check what driver if loaded into a kernel, and that is through
/proc/driver/nvidia/version
. It is present in most recent drivers:and dates back to (at least) the 71.86.01 driver released in 2007 (search for "app-o"):
Could this mechanism be used instead so that managing links would not have to be done?