Closed mboisson closed 4 years ago
For comparison, this works fine :
$ PATH=/usr/sbin:$PATH singularity exec --nv pytorch.simg bash
Singularity> exit
So, having a different command called ldconfig
, somewhere in the PATH
, produces the error.
Can you let us know what type of environment is being used that provides its own ldconfig
- to better understand what workflow this is arising in?
Singularity did begin sanitizing PATH
, in 3.1
this was further adjusted in 3.2
via https://github.com/sylabs/singularity/commit/5646ae764dddce5a1493de584884f70653f71bf2 as blanket sanitizing of the PATH caused issues for the plugin support. The GPU code has also been modified since then.
Because we bind in CUDA and ROCm related binaries that are searched on PATH
we can't use a sanitized PATH
for all of the GPU bind search. The CUDA/ROCm binaries may be in non-standard locations or provided by modules in some setups, so we need to find them in the user PATH
, not a minimal PATH
.
We would have to santize PATH
for only the ldconfig
call if we want address this issue.
Hi, We provide an OS-independent Linux layer provided by Nix OS. https://github.com/ComputeCanada/nixpkgs
This effectively means that the user-space of our system runs Nix OS, in which ldconfig does not quite work, and is not usually required to work, because NixOS sets RPATH for all binaries that are compiled within.
$ which ldconfig
/cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/ldconfig
$ ldconfig -p
ldconfig: Can't open cache file /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/etc/ld.so.cache
: No such file or directory
The underlying (i.e. privilege space) is CentOS 7, and has a working ldconfig
, but singularity finds the one which is in the PATH in the user's session instead of the one provided by the operating system.
Another thing to consider here..... there are distributions like Nix where an ldconfig
may not be in standard paths.
ldconfig -p
on NixOS is known to not work
https://github.com/NixOS/nixpkgs/issues/35387#issuecomment-367983818
@mboisson - got it. I was just thinking about Nix and Guix - as the above comment suggests.
The issue is that we have 2 unusual situations which are mutually incompatible...
1) If Nix/Guix are used as package managers on top of a base distro then you need to look in a sanitized PATH, so you get the base distro ldconfig
and not the one in the Nix or Guix store.
2) If Nix/Guix are used as a base distro then looking in a sanitized PATH for ldconfig would break things. There is no /bin/ldconfig
- we have to use the one in PATH as it's in a place in the Nix/Guix store.
These are both rather uncommon situations - and I'm not sure which to favor tbh.
I have never used Nix as a base OS, I would assume that /bin/ldconfig -p
still does not work on that platform.
How about having a configuration option for source RPM builds that will or will not sanitize, or that will hard-code the location of the ldconfig
to use ?
Sorry - clicked wrong button there.
I have never used Nix as a base OS, I would assume that
ldconfig -p
still does not work.
That's looks likely. Not sure about Guix, and we need to consider that for any possibly conflicting change as it's known to us to be used in HPC environments where Singularity is also used.
How about having a configuration option for source RPM builds that will or will not sanitize, or that will hard-code the location of the
ldconfig
to use ?
I would rather follow the pattern we have of an optional entry in singularity.conf
for an explicit path to the executable we need to call (e.g. cryptsetup
has an existing option) and not hard-code into the source via mconfig
I think we'd gladly accept a patch for this. It's not something we'd necessarily prioritize for 3.5.3
given the timeline, and that this is quite a niche situation.
I think we'd gladly accept a patch for this. It's not something we'd necessarily prioritize for
3.5.3
given the timeline, and that this is quite a niche situation.
That is unfortunate. This is all of Compute Canada (i.e. all clusters in Canada) that are affected by this. However, we don't have knowledge of the singularity code base to implement this.
@mboisson - understood, and I'll ensure it is discussed.
An alternative may be for you to install nvidia-container-cli
- if that is found, it'll be used to identify the binds instead of an ldconfig -p
search.
Can you workaround by provisioning an alias, or a wrapper script for singularity that modifies $PATH
and sits further up in $PATH
so it will run instead of the 'real' singularity binary?
Thanks. I will also have a look at nvidia-container-cli
, but getting things installed on all our clusters is somewhat of a hurdle. That's why we have the Nix compatibility layer, to be independent from the exact state of the OS. We may also consider defining
alias singularity=PATH=/usr/sbin:$PATH /path/to/singularity
Mmm, so I tried nvidia-container-cli
, and it still does not work :
$ which nvidia-container-cli
~/nvidia-container-cli
$ nvidia-container-cli info
NVRM version: 440.33.01
CUDA version: 10.2
Device Index: 0
Device Minor: 2
Model: Tesla V100-SXM2-16GB
Brand: Tesla
GPU UUID: GPU-b3a1a3ae-50ff-7bf1-4059-b271fe8fa941
Bus Location: 00000000:1d:00.0
Architecture: 7.0
$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia2
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib64/libnvidia-ml.so.440.33.01
/usr/lib64/libnvidia-cfg.so.440.33.01
/usr/lib64/libcuda.so.440.33.01
/usr/lib64/libnvidia-opencl.so.440.33.01
/usr/lib64/libnvidia-ptxjitcompiler.so.440.33.01
/usr/lib64/libnvidia-fatbinaryloader.so.440.33.01
/usr/lib64/libnvidia-compiler.so.440.33.01
/usr/lib64/libnvidia-encode.so.440.33.01
/usr/lib64/libnvidia-opticalflow.so.440.33.01
/usr/lib64/libnvcuvid.so.440.33.01
/usr/lib64/libnvidia-eglcore.so.440.33.01
/usr/lib64/libnvidia-glcore.so.440.33.01
/usr/lib64/libnvidia-tls.so.440.33.01
/usr/lib64/libnvidia-glsi.so.440.33.01
/usr/lib64/libnvidia-fbc.so.440.33.01
/usr/lib64/libnvidia-ifr.so.440.33.01
/usr/lib64/libnvidia-rtcore.so.440.33.01
/usr/lib64/libnvoptix.so.440.33.01
/usr/lib64/libGLX_nvidia.so.440.33.01
/usr/lib64/libEGL_nvidia.so.440.33.01
/usr/lib64/libGLESv2_nvidia.so.440.33.01
/usr/lib64/libGLESv1_CM_nvidia.so.440.33.01
/usr/lib64/libnvidia-glvkspirv.so.440.33.01
/usr/lib64/libnvidia-cbl.so.440.33.01
$ singularity exec --nv pytorch.simg bash
WARNING: Unable to capture nv bind points: could not execute ldconfig: exit status 1
Singularity> exit
exit
Checking the source, ldconfig -p
checks the output of either the configuration file nvliblist.conf
or the output of nvidia-container-cli
but is always executed.
What happened is that commit b59fa15e29 changed the logic in paths.go
so it no longer tries the sanitized path first but always looks in the user path (for both nvidia-container-cli
and nvliblist.conf
)
so it no longer tries the sanitized path first but always looks in the user path
I think this was intentional. There was at least 1 edge case I can remember where the system ldconfig
wouldn't give full output, but a different ldconfig
not in the system path would.
The error you seem to be getting is that ldconfig
is returning a 1
...
could not execute ldconfig: exit status 1
I can purposefully cause ldconfig
to fail of I restrict the permissions on /etc/ld.so.cache ... I'm not sure what else though could cause ldconfig
to exit with a 1.
What do you get if from a shell you run:
ldconfig -p | grep -i 'nvidia'
Does it look about right?
Yes, that is because ldconfig
is broken by design with Nix, it can not be relied upon. That's why the path needs to be sanitized.
$ ldconfig -p | grep -i 'nvidia'
ldconfig: Can't open cache file /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/etc/ld.so.cache
: No such file or directory
$ /usr/sbin/ldconfig -p | grep -i 'nvidia'
libnvidia-tls.so.440.33.01 (libc6,x86-64, OS ABI: Linux 2.3.99) => /lib64/libnvidia-tls.so.440.33.01
libnvidia-rtcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-rtcore.so.440.33.01
libnvidia-ptxjitcompiler.so.1 (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so.1
libnvidia-ptxjitcompiler.so (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so
libnvidia-opticalflow.so.1 (libc6,x86-64) => /lib64/libnvidia-opticalflow.so.1
libnvidia-opencl.so.1 (libc6,x86-64) => /lib64/libnvidia-opencl.so.1
libnvidia-ml.so.1 (libc6,x86-64) => /lib64/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /lib64/libnvidia-ml.so
libnvidia-ifr.so.1 (libc6,x86-64) => /lib64/libnvidia-ifr.so.1
libnvidia-ifr.so (libc6,x86-64) => /lib64/libnvidia-ifr.so
libnvidia-glvkspirv.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-glvkspirv.so.440.33.01
libnvidia-glsi.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-glsi.so.440.33.01
libnvidia-glcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-glcore.so.440.33.01
libnvidia-fbc.so.1 (libc6,x86-64) => /lib64/libnvidia-fbc.so.1
libnvidia-fbc.so (libc6,x86-64) => /lib64/libnvidia-fbc.so
libnvidia-fatbinaryloader.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-fatbinaryloader.so.440.33.01
libnvidia-encode.so.1 (libc6,x86-64) => /lib64/libnvidia-encode.so.1
libnvidia-encode.so (libc6,x86-64) => /lib64/libnvidia-encode.so
libnvidia-eglcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-eglcore.so.440.33.01
libnvidia-compiler.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-compiler.so.440.33.01
libnvidia-cfg.so.1 (libc6,x86-64) => /lib64/libnvidia-cfg.so.1
libnvidia-cfg.so (libc6,x86-64) => /lib64/libnvidia-cfg.so
libnvidia-cbl.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-cbl.so.440.33.01
libnvidia-allocator.so.1 (libc6,x86-64) => /lib64/libnvidia-allocator.so.1
libGLX_nvidia.so.0 (libc6,x86-64) => /lib64/libGLX_nvidia.so.0
libGLESv2_nvidia.so.2 (libc6,x86-64) => /lib64/libGLESv2_nvidia.so.2
libGLESv1_CM_nvidia.so.1 (libc6,x86-64) => /lib64/libGLESv1_CM_nvidia.so.1
libEGL_nvidia.so.0 (libc6,x86-64) => /lib64/libEGL_nvidia.so.0
@dctrud @cclerget
So ... in cmd/internal/cli/actions_linux.go
... line 256'ish have something like:
if ! sanitize {
userPath := os.Getenv("USER_PATH")
} else {
userPath := ""
}
Then have a --sanitize
, or something, CLI option that will keep the sanitized PATHs. I'm sure this could cause a headache; Where you need something from the sanitized PATH before the user PATH, but you also need something from the User Path which this will ignore.
Without changing on the system the order PATHs get set ... I don't really see a way around this specific issue. You see something I'm not?
@mboisson Is the following is working to force the Nix ldconfig to use the cache from host ?
ldconfig -r / -p | grep -i 'nvidia'
No, it still fails :
$ ldconfig -r / -p | grep -i 'nvidia'
ldconfig: Can't open cache file /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/etc/ld.so.cache
: No such file or directory
And with ldconfig -C /etc/ld.so.cache -p
?
Yes, this works :
$ ldconfig -C /etc/ld.so.cache -p | grep nvidia
libnvidia-tls.so.440.33.01 (libc6,x86-64, OS ABI: Linux 2.3.99) => /lib64/libnvidia-tls.so.440.33.01
libnvidia-rtcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-rtcore.so.440.33.01
libnvidia-ptxjitcompiler.so.1 (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so.1
libnvidia-ptxjitcompiler.so (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so
...
Hey all - I'm happy to take this on here if needed. I just replicated similar on a machine I have which has not Nix, but Guix as a package manager on top of CentOS. I'm happy to follow this through, it just won't be a top priority since there is a workaround to alias or wrap Singularity in a way to unset the Nix PATH here, and have it working...
https://github.com/sylabs/singularity/issues/5002#issuecomment-580900435
Maybe a solution that could work in all situations would be to use ldconfig
from the path, and fallback to a sanitized ldconfig
if the first one returns an error ?
Maybe a solution that could work in all situations would be to use
ldconfig
from the path, and fallback to a sanitizedldconfig
if the first one returns an error ?
Leave it with us... we'll get a solution but we need to be confident about it not causing a regression elsewhere - which involves considering and checking the weird and wonderful ways ROCm may be installed.
Given you can unset the Nix PATH
I'd encourage doing that in an alias or wrapper script for now.
Hi @mboisson - can I ask what causes the ldconfig
binary to be in bin
within your nix profile (/cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/ldconfig
)?
I've put nix
on my Fedora machine to try and follow this through exactly, rather than using my existing guix environment (although the principle is the same there).
If I setup a simple nix environment with gcc
which brings in glibc
then the gcc
binary is on my PATH
but ldconfig
is not:
dave@piran:~
12:08 PM $ which gcc
/home/dave/.nix-profile/bin/gcc
dave@piran:~
12:08 PM $ which ldconfig
/usr/sbin/ldconfig
dave@piran:~
12:08 PM $ ls /nix/store/l3sgc39zdcbhx3lvlz9rmz0mv8fbmwn4-glibc-2.27-bin/bin
catchsegv getconf iconv ldconfig locale makedb pcprofiledump rpcgen sotruss tzselect zdump
gencat getent iconvconfig ldd localedef nscd pldd sln sprof xtrace zic
... so I have no issue with --nv
or --rocm
hmmm - it shows up if glibc
is installed specifically, not coming in as a dependency of gcc
. okay.
12:17 PM $ nix-env -i glibc
warning: there are multiple derivations named 'glibc-2.27'; using the first one
installing 'glibc-2.27'
these paths will be fetched (42.59 MiB download, 131.39 MiB unpacked):
/nix/store/cb6pvqyl4rw7p93dbmr1csw67qrabz4v-glibc-2.27-debug
/nix/store/l5nwjh6clz2xnp0cxii93fq8p0c1cqn9-glibc-2.27-static
/nix/store/lan2w3ab1mvpxj3ppiw2sizh8i7rpz7s-busybox
/nix/store/n9acaakxahkv1q3av11l93p7rgd4xqsf-bootstrap-tools
copying path '/nix/store/lan2w3ab1mvpxj3ppiw2sizh8i7rpz7s-busybox' from 'https://cache.nixos.org'...
copying path '/nix/store/l5nwjh6clz2xnp0cxii93fq8p0c1cqn9-glibc-2.27-static' from 'https://cache.nixos.org'...
copying path '/nix/store/n9acaakxahkv1q3av11l93p7rgd4xqsf-bootstrap-tools' from 'https://cache.nixos.org'...
copying path '/nix/store/cb6pvqyl4rw7p93dbmr1csw67qrabz4v-glibc-2.27-debug' from 'https://cache.nixos.org'...
building '/nix/store/1w9hm9lkd26g0kmkxiw47vv1fncahx9l-user-environment.drv'...
created 120 symlinks in user environment
dave@piran:~
12:17 PM $ which ldconfig
/home/dave/.nix-profile/bin/ldconfig
@dctrud good that you found it, because I was not exactly sure. We do have glibc installed as itself indeed.
I'm afraid that this will slip from the 3.6.0 to the 3.7.0 milestone, and may be considered for a future 3.6.x patch release, as it is an unusual configuration and the issue hasn't been reported by others, and a workaround is available, by un-setting the Nix path.
The main issue is that if we introduce a change to fix this, we want it be able to be tested automatically... and also for the more 'standard' ldconfig stuff to be checked for any regression. That is somewhat difficult / heavy to do, as it needs NVIDIA drivers in the environment etc.
@mboisson - if you could try out #5671 that'd be appreciated. It should be a fix which is generally safe.
Version of Singularity:
What version of Singularity are you using? Run:
Expected behavior
Not an error with ldconfig.
Actual behavior
Error with ldconfig, because it takes the wrong ldconfig.
Explanation
I believe this is a regression from https://github.com/sylabs/singularity/issues/2445
when the util/nvidia/paths.go became https://github.com/sylabs/singularity/blob/master/pkg/util/gpu/paths.go