apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.
https://github.com/apptainer/apptainer
Other
2.53k stars 424 forks source link

--nv --rocm won't find correct ldconfig when run from Nix userspace #5002

Closed mboisson closed 4 years ago

mboisson commented 4 years ago

Version of Singularity:

What version of Singularity are you using? Run:

$ singularity --version
singularity version 3.5.2-1.el7
$ singularity exec --nv pytorch.simg bash
WARNING: Unable to capture nv bind points: could not execute ldconfig: exit status 1

Expected behavior

Not an error with ldconfig.

Actual behavior

Error with ldconfig, because it takes the wrong ldconfig.

Explanation

I believe this is a regression from https://github.com/sylabs/singularity/issues/2445

when the util/nvidia/paths.go became https://github.com/sylabs/singularity/blob/master/pkg/util/gpu/paths.go

mboisson commented 4 years ago

For comparison, this works fine :

$ PATH=/usr/sbin:$PATH singularity exec --nv pytorch.simg bash
Singularity> exit

So, having a different command called ldconfig, somewhere in the PATH, produces the error.

dtrudg commented 4 years ago

Can you let us know what type of environment is being used that provides its own ldconfig - to better understand what workflow this is arising in?

Singularity did begin sanitizing PATH, in 3.1 this was further adjusted in 3.2 via https://github.com/sylabs/singularity/commit/5646ae764dddce5a1493de584884f70653f71bf2 as blanket sanitizing of the PATH caused issues for the plugin support. The GPU code has also been modified since then.

Because we bind in CUDA and ROCm related binaries that are searched on PATH we can't use a sanitized PATH for all of the GPU bind search. The CUDA/ROCm binaries may be in non-standard locations or provided by modules in some setups, so we need to find them in the user PATH, not a minimal PATH.

We would have to santize PATH for only the ldconfig call if we want address this issue.

mboisson commented 4 years ago

Hi, We provide an OS-independent Linux layer provided by Nix OS. https://github.com/ComputeCanada/nixpkgs

This effectively means that the user-space of our system runs Nix OS, in which ldconfig does not quite work, and is not usually required to work, because NixOS sets RPATH for all binaries that are compiled within.

$ which ldconfig
/cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/ldconfig

$ ldconfig -p
ldconfig: Can't open cache file /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/etc/ld.so.cache
: No such file or directory

The underlying (i.e. privilege space) is CentOS 7, and has a working ldconfig, but singularity finds the one which is in the PATH in the user's session instead of the one provided by the operating system.

dtrudg commented 4 years ago

Another thing to consider here..... there are distributions like Nix where an ldconfig may not be in standard paths.

mboisson commented 4 years ago

ldconfig -p on NixOS is known to not work https://github.com/NixOS/nixpkgs/issues/35387#issuecomment-367983818

dtrudg commented 4 years ago

@mboisson - got it. I was just thinking about Nix and Guix - as the above comment suggests.

The issue is that we have 2 unusual situations which are mutually incompatible...

1) If Nix/Guix are used as package managers on top of a base distro then you need to look in a sanitized PATH, so you get the base distro ldconfig and not the one in the Nix or Guix store.

2) If Nix/Guix are used as a base distro then looking in a sanitized PATH for ldconfig would break things. There is no /bin/ldconfig - we have to use the one in PATH as it's in a place in the Nix/Guix store.

These are both rather uncommon situations - and I'm not sure which to favor tbh.

mboisson commented 4 years ago

I have never used Nix as a base OS, I would assume that /bin/ldconfig -p still does not work on that platform.

How about having a configuration option for source RPM builds that will or will not sanitize, or that will hard-code the location of the ldconfig to use ?

dtrudg commented 4 years ago

Sorry - clicked wrong button there.

I have never used Nix as a base OS, I would assume that ldconfig -p still does not work.

That's looks likely. Not sure about Guix, and we need to consider that for any possibly conflicting change as it's known to us to be used in HPC environments where Singularity is also used.

How about having a configuration option for source RPM builds that will or will not sanitize, or that will hard-code the location of the ldconfig to use ?

I would rather follow the pattern we have of an optional entry in singularity.conf for an explicit path to the executable we need to call (e.g. cryptsetup has an existing option) and not hard-code into the source via mconfig

I think we'd gladly accept a patch for this. It's not something we'd necessarily prioritize for 3.5.3 given the timeline, and that this is quite a niche situation.

mboisson commented 4 years ago

I think we'd gladly accept a patch for this. It's not something we'd necessarily prioritize for 3.5.3 given the timeline, and that this is quite a niche situation.

That is unfortunate. This is all of Compute Canada (i.e. all clusters in Canada) that are affected by this. However, we don't have knowledge of the singularity code base to implement this.

dtrudg commented 4 years ago

@mboisson - understood, and I'll ensure it is discussed.

An alternative may be for you to install nvidia-container-cli - if that is found, it'll be used to identify the binds instead of an ldconfig -p search.

Can you workaround by provisioning an alias, or a wrapper script for singularity that modifies $PATH and sits further up in $PATH so it will run instead of the 'real' singularity binary?

mboisson commented 4 years ago

Thanks. I will also have a look at nvidia-container-cli, but getting things installed on all our clusters is somewhat of a hurdle. That's why we have the Nix compatibility layer, to be independent from the exact state of the OS. We may also consider defining

alias singularity=PATH=/usr/sbin:$PATH /path/to/singularity
mboisson commented 4 years ago

Mmm, so I tried nvidia-container-cli, and it still does not work :

$ which nvidia-container-cli
~/nvidia-container-cli
$ nvidia-container-cli info
NVRM version:   440.33.01
CUDA version:   10.2

Device Index:   0
Device Minor:   2
Model:          Tesla V100-SXM2-16GB
Brand:          Tesla
GPU UUID:       GPU-b3a1a3ae-50ff-7bf1-4059-b271fe8fa941
Bus Location:   00000000:1d:00.0
Architecture:   7.0

$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia2
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib64/libnvidia-ml.so.440.33.01
/usr/lib64/libnvidia-cfg.so.440.33.01
/usr/lib64/libcuda.so.440.33.01
/usr/lib64/libnvidia-opencl.so.440.33.01
/usr/lib64/libnvidia-ptxjitcompiler.so.440.33.01
/usr/lib64/libnvidia-fatbinaryloader.so.440.33.01
/usr/lib64/libnvidia-compiler.so.440.33.01
/usr/lib64/libnvidia-encode.so.440.33.01
/usr/lib64/libnvidia-opticalflow.so.440.33.01
/usr/lib64/libnvcuvid.so.440.33.01
/usr/lib64/libnvidia-eglcore.so.440.33.01
/usr/lib64/libnvidia-glcore.so.440.33.01
/usr/lib64/libnvidia-tls.so.440.33.01
/usr/lib64/libnvidia-glsi.so.440.33.01
/usr/lib64/libnvidia-fbc.so.440.33.01
/usr/lib64/libnvidia-ifr.so.440.33.01
/usr/lib64/libnvidia-rtcore.so.440.33.01
/usr/lib64/libnvoptix.so.440.33.01
/usr/lib64/libGLX_nvidia.so.440.33.01
/usr/lib64/libEGL_nvidia.so.440.33.01
/usr/lib64/libGLESv2_nvidia.so.440.33.01
/usr/lib64/libGLESv1_CM_nvidia.so.440.33.01
/usr/lib64/libnvidia-glvkspirv.so.440.33.01
/usr/lib64/libnvidia-cbl.so.440.33.01
$ singularity exec --nv pytorch.simg bash
WARNING: Unable to capture nv bind points: could not execute ldconfig: exit status 1
Singularity> exit
exit
bartoldeman commented 4 years ago

Checking the source, ldconfig -p checks the output of either the configuration file nvliblist.conf or the output of nvidia-container-cli but is always executed.

What happened is that commit b59fa15e29 changed the logic in paths.go so it no longer tries the sanitized path first but always looks in the user path (for both nvidia-container-cli and nvliblist.conf)

jmstover commented 4 years ago

so it no longer tries the sanitized path first but always looks in the user path

I think this was intentional. There was at least 1 edge case I can remember where the system ldconfig wouldn't give full output, but a different ldconfig not in the system path would.

The error you seem to be getting is that ldconfig is returning a 1 ...

could not execute ldconfig: exit status 1

I can purposefully cause ldconfig to fail of I restrict the permissions on /etc/ld.so.cache ... I'm not sure what else though could cause ldconfig to exit with a 1.

What do you get if from a shell you run:

ldconfig -p | grep -i 'nvidia'

Does it look about right?

mboisson commented 4 years ago

Yes, that is because ldconfig is broken by design with Nix, it can not be relied upon. That's why the path needs to be sanitized.

$ ldconfig -p | grep -i 'nvidia'
ldconfig: Can't open cache file /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/etc/ld.so.cache
: No such file or directory
$ /usr/sbin/ldconfig -p | grep -i 'nvidia'
    libnvidia-tls.so.440.33.01 (libc6,x86-64, OS ABI: Linux 2.3.99) => /lib64/libnvidia-tls.so.440.33.01
    libnvidia-rtcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-rtcore.so.440.33.01
    libnvidia-ptxjitcompiler.so.1 (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so.1
    libnvidia-ptxjitcompiler.so (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so
    libnvidia-opticalflow.so.1 (libc6,x86-64) => /lib64/libnvidia-opticalflow.so.1
    libnvidia-opencl.so.1 (libc6,x86-64) => /lib64/libnvidia-opencl.so.1
    libnvidia-ml.so.1 (libc6,x86-64) => /lib64/libnvidia-ml.so.1
    libnvidia-ml.so (libc6,x86-64) => /lib64/libnvidia-ml.so
    libnvidia-ifr.so.1 (libc6,x86-64) => /lib64/libnvidia-ifr.so.1
    libnvidia-ifr.so (libc6,x86-64) => /lib64/libnvidia-ifr.so
    libnvidia-glvkspirv.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-glvkspirv.so.440.33.01
    libnvidia-glsi.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-glsi.so.440.33.01
    libnvidia-glcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-glcore.so.440.33.01
    libnvidia-fbc.so.1 (libc6,x86-64) => /lib64/libnvidia-fbc.so.1
    libnvidia-fbc.so (libc6,x86-64) => /lib64/libnvidia-fbc.so
    libnvidia-fatbinaryloader.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-fatbinaryloader.so.440.33.01
    libnvidia-encode.so.1 (libc6,x86-64) => /lib64/libnvidia-encode.so.1
    libnvidia-encode.so (libc6,x86-64) => /lib64/libnvidia-encode.so
    libnvidia-eglcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-eglcore.so.440.33.01
    libnvidia-compiler.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-compiler.so.440.33.01
    libnvidia-cfg.so.1 (libc6,x86-64) => /lib64/libnvidia-cfg.so.1
    libnvidia-cfg.so (libc6,x86-64) => /lib64/libnvidia-cfg.so
    libnvidia-cbl.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-cbl.so.440.33.01
    libnvidia-allocator.so.1 (libc6,x86-64) => /lib64/libnvidia-allocator.so.1
    libGLX_nvidia.so.0 (libc6,x86-64) => /lib64/libGLX_nvidia.so.0
    libGLESv2_nvidia.so.2 (libc6,x86-64) => /lib64/libGLESv2_nvidia.so.2
    libGLESv1_CM_nvidia.so.1 (libc6,x86-64) => /lib64/libGLESv1_CM_nvidia.so.1
    libEGL_nvidia.so.0 (libc6,x86-64) => /lib64/libEGL_nvidia.so.0
jmstover commented 4 years ago

@dctrud @cclerget So ... in cmd/internal/cli/actions_linux.go ... line 256'ish have something like:

if ! sanitize {
    userPath := os.Getenv("USER_PATH")
} else {
    userPath := ""
}

Then have a --sanitize, or something, CLI option that will keep the sanitized PATHs. I'm sure this could cause a headache; Where you need something from the sanitized PATH before the user PATH, but you also need something from the User Path which this will ignore.

Without changing on the system the order PATHs get set ... I don't really see a way around this specific issue. You see something I'm not?

cclerget commented 4 years ago

@mboisson Is the following is working to force the Nix ldconfig to use the cache from host ?

ldconfig -r / -p | grep -i 'nvidia'
mboisson commented 4 years ago

No, it still fails :

$ ldconfig -r / -p | grep -i 'nvidia'
ldconfig: Can't open cache file /cvmfs/soft.computecanada.ca/nix/store/63pk88rnmkzjblpxydvrmskkc8ci7cx6-glibc-2.24/etc/ld.so.cache
: No such file or directory
cclerget commented 4 years ago

And with ldconfig -C /etc/ld.so.cache -p ?

mboisson commented 4 years ago

Yes, this works :

$ ldconfig -C /etc/ld.so.cache -p | grep nvidia
    libnvidia-tls.so.440.33.01 (libc6,x86-64, OS ABI: Linux 2.3.99) => /lib64/libnvidia-tls.so.440.33.01
    libnvidia-rtcore.so.440.33.01 (libc6,x86-64) => /lib64/libnvidia-rtcore.so.440.33.01
    libnvidia-ptxjitcompiler.so.1 (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so.1
    libnvidia-ptxjitcompiler.so (libc6,x86-64) => /lib64/libnvidia-ptxjitcompiler.so
...
dtrudg commented 4 years ago

Hey all - I'm happy to take this on here if needed. I just replicated similar on a machine I have which has not Nix, but Guix as a package manager on top of CentOS. I'm happy to follow this through, it just won't be a top priority since there is a workaround to alias or wrap Singularity in a way to unset the Nix PATH here, and have it working...

https://github.com/sylabs/singularity/issues/5002#issuecomment-580900435

mboisson commented 4 years ago

Maybe a solution that could work in all situations would be to use ldconfig from the path, and fallback to a sanitized ldconfig if the first one returns an error ?

dtrudg commented 4 years ago

Maybe a solution that could work in all situations would be to use ldconfig from the path, and fallback to a sanitized ldconfig if the first one returns an error ?

Leave it with us... we'll get a solution but we need to be confident about it not causing a regression elsewhere - which involves considering and checking the weird and wonderful ways ROCm may be installed.

Given you can unset the Nix PATH I'd encourage doing that in an alias or wrapper script for now.

dtrudg commented 4 years ago

Hi @mboisson - can I ask what causes the ldconfig binary to be in bin within your nix profile (/cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/ldconfig)?

I've put nix on my Fedora machine to try and follow this through exactly, rather than using my existing guix environment (although the principle is the same there).

If I setup a simple nix environment with gcc which brings in glibc then the gcc binary is on my PATH but ldconfig is not:

dave@piran:~
12:08 PM $ which gcc
/home/dave/.nix-profile/bin/gcc
dave@piran:~
12:08 PM $ which ldconfig
/usr/sbin/ldconfig
dave@piran:~
12:08 PM $ ls /nix/store/l3sgc39zdcbhx3lvlz9rmz0mv8fbmwn4-glibc-2.27-bin/bin
catchsegv  getconf  iconv        ldconfig  locale     makedb  pcprofiledump  rpcgen  sotruss  tzselect  zdump
gencat     getent   iconvconfig  ldd       localedef  nscd    pldd           sln     sprof    xtrace    zic

... so I have no issue with --nv or --rocm

dtrudg commented 4 years ago

hmmm - it shows up if glibc is installed specifically, not coming in as a dependency of gcc. okay.

12:17 PM $ nix-env -i glibc
warning: there are multiple derivations named 'glibc-2.27'; using the first one
installing 'glibc-2.27'
these paths will be fetched (42.59 MiB download, 131.39 MiB unpacked):
  /nix/store/cb6pvqyl4rw7p93dbmr1csw67qrabz4v-glibc-2.27-debug
  /nix/store/l5nwjh6clz2xnp0cxii93fq8p0c1cqn9-glibc-2.27-static
  /nix/store/lan2w3ab1mvpxj3ppiw2sizh8i7rpz7s-busybox
  /nix/store/n9acaakxahkv1q3av11l93p7rgd4xqsf-bootstrap-tools
copying path '/nix/store/lan2w3ab1mvpxj3ppiw2sizh8i7rpz7s-busybox' from 'https://cache.nixos.org'...
copying path '/nix/store/l5nwjh6clz2xnp0cxii93fq8p0c1cqn9-glibc-2.27-static' from 'https://cache.nixos.org'...
copying path '/nix/store/n9acaakxahkv1q3av11l93p7rgd4xqsf-bootstrap-tools' from 'https://cache.nixos.org'...
copying path '/nix/store/cb6pvqyl4rw7p93dbmr1csw67qrabz4v-glibc-2.27-debug' from 'https://cache.nixos.org'...
building '/nix/store/1w9hm9lkd26g0kmkxiw47vv1fncahx9l-user-environment.drv'...
created 120 symlinks in user environment
dave@piran:~
12:17 PM $ which ldconfig
/home/dave/.nix-profile/bin/ldconfig
mboisson commented 4 years ago

@dctrud good that you found it, because I was not exactly sure. We do have glibc installed as itself indeed.

dtrudg commented 4 years ago

I'm afraid that this will slip from the 3.6.0 to the 3.7.0 milestone, and may be considered for a future 3.6.x patch release, as it is an unusual configuration and the issue hasn't been reported by others, and a workaround is available, by un-setting the Nix path.

The main issue is that if we introduce a change to fix this, we want it be able to be tested automatically... and also for the more 'standard' ldconfig stuff to be checked for any regression. That is somewhat difficult / heavy to do, as it needs NVIDIA drivers in the environment etc.

dtrudg commented 4 years ago

@mboisson - if you could try out #5671 that'd be appreciated. It should be a fix which is generally safe.