CUDA driver check mechanism

dmagdavector commented 3 months ago

In the SitePackage.lua file there is a check for libcuda.so:

https://github.com/ComputeCanada/software-stack-config/blob/main/lmod/SitePackage.lua#L203

However, as mentioned in the documentation:

https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location

this path is not used in more recent Nvidia drivers, and so after installing the driver packages, links need to be created so that the DRAC/CC can do its auto-detection magic.

It seems that Nvidia has an official way to check what driver if loaded into a kernel, and that is through /proc/driver/nvidia/version. It is present in most recent drivers:

https://download.nvidia.com/XFree86/Linux-x86_64/555.58.02/README/procinterface.html

and dates back to (at least) the 71.86.01 driver released in 2007 (search for "app-o"):

https://download.nvidia.com/XFree86/Linux-x86_64/71.86.01/README/readme.txt

Could this mechanism be used instead so that managing links would not have to be done?

mboisson commented 3 months ago

Hi @dmagdavector managing links would still need to be done, as this is where the CUDA libraries will be found by our software stack.

CUDA libraries need to be found for CUDA executable to be found, and OS libraries must not be found to avoid incompatibilities (things like different glibc versions), so CUDA and OS libraries must be findable in disjoint directories.

More information available here in issues https://github.com/ComputeCanada/software-stack/issues/58 and https://github.com/ComputeCanada/software-stack/issues/79

dmagdavector commented 3 months ago

CUDA libraries need to be found for CUDA executable to be found, and OS libraries must not be found to avoid incompatibilities (things like different glibc versions) […]

Isn't this what rpath is for?

https://en.wikipedia.org/wiki/Rpath

From the example section:

$ cc -shared -Wl,-soname,termcap.so.4,-rpath,/lib/termcap.so.4 -o termcap.so.4

$ objdump -x termcap.so.4 NEEDED libc.so.6 SONAME termcap.so.4 RPATH /lib/termcap.so.4

In this example, GNU or Sun ld (ld.so) will REFUSE to load termcap for a program needing it unless the file termcap.so is in /lib/ and named termcap.so.4. LD_LIBRARY_PATH is ignored. […]

Would it be possible to say "for libcuda.so.1 use Path X and for libwhatever.so.2 use Path Y"?

mboisson commented 3 months ago

Yes, but we can't have /usr/lib64 in the RPATH, otherwise it will also find all other system libraries. RPATH is not specific to each dependent library.

bartoldeman commented 3 months ago

There are many things going on here, it's complicated!

On our clusters the login nodes do not have GPUs but we still want the logic to apply, to see if a specific CUDA toolkit module version is usable or not. The /proc method will not work for that.
Many programs and libraries dlopen libcuda.so so libcuda.so needs to be found somewhere in the standard search path, in our case this is RPATH, LD_LIBRARY_PATH, RUNPATH, libraries listed in ldconfig -p, $EPREFIX/usr/lib64, /usr/lib64/nvidia, in that order. We are not using RPATH=/usr/lib64/nvidia, since it's already there (configure via the Gentoo Prefix' glibc's compiled-in search paths)
So you can if you mount our cvmfs yourself get away with not using /usr/lib64/nvidia, but you then need to put the directory you are using in LD_LIBRARY_PATH, and bypass some of our lmod logic by setting RSNT_CUDA_DRIVER_VERSION e.g. you could do something like (untested)
```
export RSNT_CUDA_DRIVER_VERSION=$(cat /proc/driver/nvidia/version)
export LD_LIBRARY_PATH=<path where you have libcuda.so>
```
Note that /usr/lib64/nvidia isn't solely used for libcuda.so but also for EGL and OpenGL (GLX), when e.g. running VirtualGL or ParaView on GPU nodes; EGL and glvnd mechanisms look for e.g. libEGL_nvidia.so.0, which in turn is linked to a number of other libraries there.

mboisson commented 3 months ago

I will add that, while you can set export LD_LIBRARY_PATH=<path where you have libcuda.so>, this absolutely can NOT be /usr/lib64 (i.e. you can't set LD_LIBRARY_PATH=/usr/lib64, that will break everything.

ComputeCanada / software-stack-config

CUDA driver check mechanism #92