Compiled commands randomly fail to start (segmentation fault) with opengl=nv when nvidia-container-cli is missing

ylep commented 4 years ago

(venv) ➜ ~ % casa_distro run brainvisa verbose=1
----------------------------------------
Running singularity with the following command:
'singularity' 'run' '--cleanenv' '--pwd' '/casa/host/home' '--bind' '/neurospin:/neurospin' '--bind' '/mnt:/mnt' '--bind' '/srv:/srv' '--bind' '/media:/media' '--bind' '/volatile:/volatile' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host:/casa/host' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host/home:/casa/home' '--bind' '/volatile2:/volatile2' '--bind' '/i2bm:/i2bm' '--bind' '/home/yl243478' '--home' '/casa/host/home' '--nv' '--env' 'PS1=\[\033[33m\]\u@\h \$\[\033[0m\] ' '/volatile/bv/casa-distro-3-repo/casa-dev-ubuntu-18.04.sif' 'brainvisa'

Using the following environment:
[...]
----------------------------------------
Segmentation fault
(venv) (139)➜ ~ % casa_distro run brainvisa verbose=1
----------------------------------------
Running singularity with the following command:
'singularity' 'run' '--cleanenv' '--pwd' '/casa/host/home' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host:/casa/host' '--bind' '/volatile2:/volatile2' '--bind' '/media:/media' '--bind' '/neurospin:/neurospin' '--bind' '/srv:/srv' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host/home:/casa/home' '--bind' '/volatile:/volatile' '--bind' '/mnt:/mnt' '--bind' '/i2bm:/i2bm' '--bind' '/home/yl243478' '--home' '/casa/host/home' '--nv' '--env' 'PS1=\[\033[33m\]\u@\h \$\[\033[0m\] ' '/volatile/bv/casa-distro-3-repo/casa-dev-ubuntu-18.04.sif' 'brainvisa'

Using the following environment:
[...]
----------------------------------------
Segmentation fault
(venv) (139)➜ ~ % casa_distro run brainvisa verbose=1
----------------------------------------
Running singularity with the following command:
'singularity' 'run' '--cleanenv' '--pwd' '/casa/host/home' '--bind' '/i2bm:/i2bm' '--bind' '/mnt:/mnt' '--bind' '/volatile2:/volatile2' '--bind' '/srv:/srv' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host:/casa/host' '--bind' '/volatile:/volatile' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host/home:/casa/home' '--bind' '/neurospin:/neurospin' '--bind' '/media:/media' '--bind' '/home/yl243478' '--home' '/casa/host/home' '--nv' '--env' 'PS1=\[\033[33m\]\u@\h \$\[\033[0m\] ' '/volatile/bv/casa-distro-3-repo/casa-dev-ubuntu-18.04.sif' 'brainvisa'

Using the following environment:
[...]
----------------------------------------
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-yl243478'
Loading toolbox baby
Loading toolbox cortical_surface
Loading toolbox data management
Loading toolbox datamind
Loading toolbox disco
Loading toolbox freesurfer
Loading toolbox morphologist
Loading toolbox nuclearimaging
Loading toolbox spm
Loading toolbox structural_analysis
Loading toolbox tms
Loading toolbox tools
Loading toolbox viewers
Loading toolbox My processes
soma-workflow starting in light mode
Workflow controller initialised
The log file for this session is '/casa/host/home/.brainvisa/brainvisa3.log'

ylep commented 4 years ago

Also anatomist, and even AimsFileInfo exhibit the same behaviour on my workstation

(venv) ➜ ~ % casa_distro run verbose=1 AimsFileInfo              
----------------------------------------
Running singularity with the following command:
'singularity' 'run' '--cleanenv' '--pwd' '/casa/host/home' '--bind' '/mnt:/mnt' '--bind' '/volatile2:/volatile2' '--bind' '/media:/media' '--bind' '/volatile:/volatile' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host/home:/casa/home' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host:/casa/host' '--bind' '/neurospin:/neurospin' '--bind' '/srv:/srv' '--bind' '/i2bm:/i2bm' '--bind' '/home/yl243478' '--home' '/casa/host/home' '--nv' '--env' 'PS1=\[\033[33m\]\u@\h \$\[\033[0m\] ' '/volatile/bv/casa-distro-3-repo/casa-dev-ubuntu-18.04.sif' 'AimsFileInfo'

Using the following environment:
[...]
    SINGULARITYENV_CASA_BRANCH=bug_fix
    SINGULARITYENV_CASA_DISTRO=brainvisa-dev-ubuntu-18.04
    SINGULARITYENV_CASA_HOST_DIR=/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04
    SINGULARITYENV_CASA_SYSTEM=ubuntu-18.04
    SINGULARITYENV_DISPLAY=:0
    SINGULARITYENV_XAUTHORITY=/casa/host/home/.Xauthority
[...]
----------------------------------------
AimsFileInfo: value missing for option "-i"

(venv) (1)➜ ~ % casa_distro run verbose=1 AimsFileInfo
----------------------------------------
Running singularity with the following command:
'singularity' 'run' '--cleanenv' '--pwd' '/casa/host/home' '--bind' '/i2bm:/i2bm' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host/home:/casa/home' '--bind' '/mnt:/mnt' '--bind' '/media:/media' '--bind' '/srv:/srv' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host:/casa/host' '--bind' '/neurospin:/neurospin' '--bind' '/volatile2:/volatile2' '--bind' '/volatile:/volatile' '--bind' '/home/yl243478' '--home' '/casa/host/home' '--nv' '--env' 'PS1=\[\033[33m\]\u@\h \$\[\033[0m\] ' '/volatile/bv/casa-distro-3-repo/casa-dev-ubuntu-18.04.sif' 'AimsFileInfo'

Using the following environment:
[...]
    SINGULARITYENV_CASA_BRANCH=bug_fix
    SINGULARITYENV_CASA_DISTRO=brainvisa-dev-ubuntu-18.04
    SINGULARITYENV_CASA_HOST_DIR=/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04
    SINGULARITYENV_CASA_SYSTEM=ubuntu-18.04
    SINGULARITYENV_DISPLAY=:0
    SINGULARITYENV_XAUTHORITY=/casa/host/home/.Xauthority
[...]
----------------------------------------
Segmentation fault

denisri commented 4 years ago

I don't get this behaviour (but I have not updated my build yet). Coud you please run:

casa_distro run verbose=1 gdb AimsFileInfo

then run the programs in gdb to get a traceback ?

ylep commented 4 years ago

casa_distro run verbose=1 gdb AimsFileInfo

(gdb) bt
#0  0x00007ffff337e76e in pthread_mutex_init (mutex=0x5555558361f0, mutexattr=0x0) at forward.c:188
#1  0x00007fffe40176d2 in QWaitCondition::QWaitCondition() () at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#2  0x00007fffe3082820 in  () at /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5
#3  0x00007ffff7de5783 in call_init (env=0x7fffffffe768, argv=0x7fffffffe758, argc=1, l=<optimized out>) at dl-init.c:72
#4  0x00007ffff7de5783 in _dl_init (main_map=main_map@entry=0x55555582e4c0, argc=1, argv=0x7fffffffe758, env=0x7fffffffe768)
    at dl-init.c:119
#5  0x00007ffff7dea24f in dl_open_worker (a=a@entry=0x7fffffffdc90) at dl-open.c:522
#6  0x00007ffff33b551f in __GI__dl_catch_exception (exception=0x7fffffffdc70, operate=0x7ffff7de9e10 <dl_open_worker>, args=0x7fffffffdc90) at dl-error-skeleton.c:196
#7  0x00007ffff7de981a in _dl_open (file=0x5555557c5fa0 "libaimsqsqlgraphformat.so.4.6.2", mode=-2147483390, caller_dlopen=0x7ffff5079750 <carto::PluginLoader::loadPluginFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+2896>, nsid=<optimized out>, argc=1, argv=<optimized out>, env=0x7fffffffe768) at dl-open.c:605
#8  0x00007ffff0843f96 in dlopen_doit (a=a@entry=0x7fffffffdec0) at dlopen.c:66
#9  0x00007ffff33b551f in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffde60, operate=0x7ffff0843f40 <dlopen_doit>, args=0x7fffffffdec0) at dl-error-skeleton.c:196
#10 0x00007ffff33b55af in __GI__dl_catch_error (objname=0x5555557c6040, errstring=0x5555557c6048, mallocedp=0x5555557c6038, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#11 0x00007ffff0844745 in _dlerror_run (operate=operate@entry=0x7ffff0843f40 <dlopen_doit>, args=args@entry=0x7fffffffdec0)
    at dlerror.c:162
#12 0x00007ffff0844051 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#13 0x00007ffff5079750 in carto::PluginLoader::loadPluginFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) ()
    at /casa/host/build/lib/libsoma-io.so.4.6.2
#14 0x00007ffff5079f67 in carto::PluginLoader::load(int, bool) () at /casa/host/build/lib/libsoma-io.so.4.6.2
#15 0x00007ffff504bf5b in carto::CartoApplication::initialize() () at /casa/host/build/lib/libsoma-io.so.4.6.2
#16 0x00007ffff6c209af in aims::AimsApplication::initialize() () at /casa/host/build/lib/libaimsdata.so.4.6.2
#17 0x0000555555560179 in main ()

denisri commented 4 years ago

So it crashes in Qt/pthread while loading the IO module libaimsqsqlgraphformat. This module is actually linked against Qt, itself linked against OpenGL - not sure at all it has anything to do with the problem here but it is a "usual suspect" when things are not working well ;) But here it's rather in threads that the problem occurs. I don't know what to think about that...

ylep commented 4 years ago

I can confirm that the issue occurs only with opengl=nv (or auto). It does not occur with opengl=container or opengl=software.

I have version 384.130 of the nvidia driver, on Ubuntu 16.04.7.

ylep commented 4 years ago

Also, I am using Python 3.5.2 to run casa-distro, which means that the ordering of dictionary iteration is randomized. This might be the cause of the randomness

ylep commented 4 years ago

Alas, I get the same error with Python 2. Indeed, the same environment being run with the exact same singularity command-line randomly succeeds or fails.

However, I noticed some interesting behaviour: in a casa-distro shell where the segfault occurs, it will occur consistently every time.

yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault

In a shell where it does not occur, it will never occur.

yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"

denisri commented 4 years ago

I have a similar situation at home for OpenGL commands. Inside a given casa-distro shell, things are consistent (all work or all segfault), but from one run of casa_distro to another, it's completely random. Is it a problem in Singularity ? In our setup or images ? In the nvidia driver ? How can we tell ? This erratic behaviour is a serious problem: if we release a new brainvisa version in this situation, many users will complain (or drop the software) and get a very bad opinion of it.

ylep commented 4 years ago

Can you remind us what version are your Ubuntu system and your nvidia driver on your machine?

denisri commented 4 years ago

It's Ubuntu 18.04 with driver 390.138 I think (not totally sure which one is actually loaded now, I am using a remote connection to it)

ylep commented 4 years ago

I found a way to fix the issue on my machine: by installing nvidia-container-cli (the libnvidia-container-tools package) as described on https://nvidia.github.io/libnvidia-container/

It seems that Singularity is able to use that tool from NVidia (since Singularity 2.6, see the release notes) in order to better configure the GPU in the container. I have no idea what it actually does.

So, I will change the behaviour of opengl=auto to only activate --nv if nvidia-container-cli is present.

denisri commented 4 years ago

How did you do that ? I'm following the instructions on https://nvidia.github.io/libnvidia-container/, I have added the repository, but it seems empty - no nvidia-container-tools package. When I get to the URL of the repository, https://nvidia.github.io/libnvidia-container/stable/ubuntu16.04/amd64/ using a web browser, I just see the message:

# Unsupported distribution! # Check https://nvidia.github.io/libnvidia-container

I'm on an ubuntu 16.04 laptop, which is actually listed as supported...

denisri commented 4 years ago

No more luck on a ubuntu 18.04 machine...

ylep commented 4 years ago

@denisri My bad, the package is actually named libnvidia-container-tools. Sorry.

denisri commented 4 years ago

Oh thanks. Search tools for apt are so poor that I couldn't find it on my own... I had to install an older version on my laptop since the latest are not compatible with older drivers (https://github.com/NVIDIA/nvidia-docker/issues/1280). Do we just have to install the package, and that's all ? Or do we need to use it manually to configure anything ? In other words, have you understood a little bit what this tool does and how it is used (automatically ?) by singularity ? By chance can singularity use it inside a container, if we install it in the container (this would actually be great) ?

ylep commented 4 years ago

I have just installed the package and it magically works. I have not yet looked at all at how the tool works. I just noticed that it fixes another failure scenario: when you use --nv on a X server that cannot access NVidia hardware (Xvnc or x2go) it will stop some of the NVidia libs from being loaded, which allows the software to work (with mesa) instead of crashing consistently.

denisri commented 4 years ago

Well, after installing libnvidia-container-tools version 1.0.7 (which seems to work), singularity couldn't run glxgears (X errors) at all using --nv. After searching a bit (https://github.com/hpcng/singularity/pull/1681/commits/fa7162ca91df6cfbebeed54aa8bde9958169b765) it seems that nvidia-container-cli is called with options list -cguv, which do not exist (probably did not in this version), and this makes the whole thing get wrong (in the container, in /.singularity/libs I see some nvidia and cuda libs, but no libGL). So this solution only works for recent versions of libnvidia-container-tools, itself working only with recent nvidia drivers (at least until they release a newer version of the tool fixing the driver version problem). Thus we maybe have to test the tool itself before using --nv in the option opengl=auto in casa_distro.

denisri commented 4 years ago

The options -cguv don't exist in recent nvidia-container-cli either so the code above must be outdated. Anyway using older versions of the tool, singularity doesn't work correctly. Installing a newer version (not working on my system) doesn't seem to harm in singularity (but is not helpful, thus).

ylep commented 4 years ago

It seems that current Singularity only calls nvidia-container-cli in two ways, in order to retrieve the list of files that it will mount into the container:

nvidia-container-cli list --binaries --libraries
nvidia-container-cli list --ipcs

https://github.com/hpcng/singularity/blob/v3.6.3/pkg/util/gpu/paths.go#L91-L93 https://github.com/hpcng/singularity/blob/v3.6.3/pkg/util/gpu/paths.go#L230

If there is an incompatibility between certain versions of Singularity and nvidia-container-cli, this is clearly a bug in Singularity, maybe we should document it... or auto-detect it if there is an easy way

denisri commented 4 years ago

OK the bug is in nvidia-container-cli (1.0.7) then: on an ubuntu 16.04 host, it lists some nvidia libraries, but not libGL, which in this case doesn't get mounted in the container. If this is OK on recent distributions / drivers (a single libGL on the systems, using libgldispatch to switch to an implementation), it was not implemented this way on ubuntu 16 + old drivers (not sure the driver version matters since the system libGL doesn't use libgldispatch yet). So my bet is that nvidia-container-cli did just not work on ubuntu 16.04 (do recent versions work with newer drivers, if there are newer drivers for ubuntu 16 ?) Testing it is a bit tricky: we should check whether nvidia-container-cli list --libraries contains libGL or not, and if not, check if the system libGL links against libgldispatch or not. If both are not, it will presumably not work. But then we end up doing a part of nvidia-container-cli list's work...

ylep commented 4 years ago

So my bet is that nvidia-container-cli did just not work on ubuntu 16.04 (do recent versions work with newer drivers, if there are newer drivers for ubuntu 16 ?)

nvidia-container-cli version 1.3.0 works perfectly on my Ubuntu 16.04 workstation, which is using version 384.130 of the NVidia driver.

Can't you upgrade your NVidia driver, in order to use a recent nvidia-container-cli? I think that if the NVidia software/drivers are broken, we just cannot support them. If people need an old driver to support old hardware, then software rendering is the only option.

ylep commented 4 years ago

it lists some nvidia libraries, but not libGL, which in this case doesn't get mounted in the container.

Oh, I just noticed that it does not include libGL either on my workstation:

% nvidia-container-cli list         
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia-384/bin/nvidia-smi
/usr/lib/nvidia-384/bin/nvidia-debugdump
/usr/lib/nvidia-384/bin/nvidia-persistenced
/usr/lib/nvidia-384/bin/nvidia-cuda-mps-control
/usr/lib/nvidia-384/bin/nvidia-cuda-mps-server
/usr/lib/nvidia-384/libnvidia-ml.so.384.130
/usr/lib/nvidia-384/libnvidia-cfg.so.384.130
/usr/lib/x86_64-linux-gnu/libcuda.so.384.130
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.384.130
/usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.384.130
/usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.130
/usr/lib/nvidia-384/libnvidia-compiler.so.384.130
/usr/lib/nvidia-384/vdpau/libvdpau_nvidia.so.384.130
/usr/lib/nvidia-384/libnvidia-encode.so.384.130
/usr/lib/nvidia-384/libnvcuvid.so.384.130
/usr/lib/nvidia-384/libnvidia-eglcore.so.384.130
/usr/lib/nvidia-384/libnvidia-glcore.so.384.130
/usr/lib/nvidia-384/tls/libnvidia-tls.so.384.130
/usr/lib/nvidia-384/libnvidia-glsi.so.384.130
/usr/lib/nvidia-384/libnvidia-fbc.so.384.130
/usr/lib/nvidia-384/libnvidia-ifr.so.384.130
/usr/lib/nvidia-384/libGLX_nvidia.so.384.130
/usr/lib/nvidia-384/libEGL_nvidia.so.384.130
/usr/lib/nvidia-384/libGLESv2_nvidia.so.384.130
/usr/lib/nvidia-384/libGLESv1_CM_nvidia.so.384.130
/usr/lib32/nvidia-384/libnvidia-ml.so.384.130
/usr/lib32/nvidia-384/libnvidia-cfg.so.384.130
/usr/lib/i386-linux-gnu/libcuda.so.384.130
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.384.130
/usr/lib32/nvidia-384/libnvidia-ptxjitcompiler.so.384.130
/usr/lib32/nvidia-384/libnvidia-fatbinaryloader.so.384.130
/usr/lib32/nvidia-384/libnvidia-compiler.so.384.130
/usr/lib32/nvidia-384/vdpau/libvdpau_nvidia.so.384.130
/usr/lib32/nvidia-384/libnvidia-encode.so.384.130
/usr/lib32/nvidia-384/libnvcuvid.so.384.130
/usr/lib32/nvidia-384/libnvidia-eglcore.so.384.130
/usr/lib32/nvidia-384/libnvidia-glcore.so.384.130
/usr/lib32/nvidia-384/tls/libnvidia-tls.so.384.130
/usr/lib32/nvidia-384/libnvidia-glsi.so.384.130
/usr/lib32/nvidia-384/libnvidia-fbc.so.384.130
/usr/lib32/nvidia-384/libnvidia-ifr.so.384.130
/usr/lib32/nvidia-384/libGLX_nvidia.so.384.130
/usr/lib32/nvidia-384/libEGL_nvidia.so.384.130
/usr/lib32/nvidia-384/libGLESv2_nvidia.so.384.130
/usr/lib32/nvidia-384/libGLESv1_CM_nvidia.so.384.130
/run/nvidia-persistenced/socket

denisri commented 4 years ago

Not sure why the driver hasn't get updated. I think some new drivers versions can drop support for some (older) hardware, thus cannot be updated, but I don't know if it is my situation here. I'll check that.

denisri commented 4 years ago

So you don't have a libGL in /.singularity/libs ? And it works ? How does it do ? Anyway you have much more libs here than I have on my system...

denisri commented 4 years ago

After upgrading the driver to 384.130, nvidia-container-cli starts working (this is the good news). The bad news is that, now, programs sometimes crash (whereas they never did with the older driver 340):

denis@averell $ ls /.singularity.d/libs
libEGL_nvidia.so.0             libnvidia-encode.so.1
libGLESv1_CM_nvidia.so.1       libnvidia-fatbinaryloader.so.384.130
libGLESv2_nvidia.so.2          libnvidia-fbc.so
libGLX_nvidia.so.0             libnvidia-fbc.so.1
libcuda.so                     libnvidia-glcore.so.384.130
libcuda.so.1                   libnvidia-glsi.so.384.130
libnvcuvid.so                  libnvidia-ifr.so
libnvcuvid.so.1                libnvidia-ifr.so.1
libnvidia-cfg.so               libnvidia-ml.so
libnvidia-cfg.so.1             libnvidia-ml.so.1
libnvidia-compiler.so          libnvidia-opencl.so.1
libnvidia-compiler.so.384.130  libnvidia-ptxjitcompiler.so.1
libnvidia-eglcore.so.384.130   libnvidia-tls.so.384.130
libnvidia-encode.so            libvdpau_nvidia.so
denis@averell $ glxgears 
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

ylep commented 4 years ago

What method are you using to access your desktop remotely? I previously found that I got different behaviour when using x11vnc with a physical X server, turbovnc, or x2go.

However, since I installed nvidia-container-cli, OpenGL works consistently with --nv under these 3 setups, because it is able to fall back to software rendering (using the APT-installed mesa in the container) when I am not using a physical X server.

If we have different behaviour, we should find what is the difference between our setups. One obvious difference is that you are under Ubuntu 18.04 and I am under Ubuntu 16.04. We may not have exactly the same version of the casa-dev image, I am running a pull_image right now and I will test again.

Edit: @denisri I just ran pull_image and now I have the same behaviour as you (random failures). Too bad I did not keep the old image to run a diff... I will investigate what changed in the images recently.

denisri commented 4 years ago

Here I'm speaking of my laptop, locally, and it is running ubuntu 16.04 (I have another machine running ubuntu 18.04, which I currently access remotely, so you're right I was not clear). So now nvidia-container-cli seems to be working on ubuntu 16.04 + driver 384, and I now get random crashes. Before I upgrade the driver (I was using 340) I didn't experience any unstability on this laptop (without the need for nvidia-container-cli). So for me here, nvidia-container-cli doesn't seem to really help... At home on the ubuntu 18.04 machine I did experience crashes but I had not installed nvidia-container-cli yet. I can't really test that remotely on this machine (well I could perhaps use x11vnc or another remote desktop system, but I have not for now). Now could there be a link with the image ?

denisri commented 4 years ago

When I backtrace a crashing program (glxgears), I get:

denis@averell $ gdb glxgears
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from glxgears...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/glxgears 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
*** stack smashing detected ***: <unknown> terminated

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff70c28b1 in __GI_abort () at abort.c:79
#2  0x00007ffff710b907 in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7ffff7238be8 "*** %s ***: %s terminated\n")
    at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff71b6e81 in __GI___fortify_fail_abort (
    need_backtrace=need_backtrace@entry=false, 
    msg=msg@entry=0x7ffff7238bc6 "stack smashing detected")
    at fortify_fail.c:33
#4  0x00007ffff71b6e42 in __stack_chk_fail () at stack_chk_fail.c:29
#5  0x00007ffff71e957a in __GI__dl_catch_exception (exception=0x7fffffffdf60, 
    operate=0x7ffff7de9e10 <dl_open_worker>, args=0x7fffffffdf80)
    at dl-error-skeleton.c:207
#6  0x00007ffff7de981a in _dl_open (file=0x55555576b0a0 "libGLX_nvidia.so.0", 
    mode=-2147483647, caller_dlopen=0x7ffff6e59606, nsid=<optimized out>, 
    argc=1, argv=<optimized out>, env=0x7fffffffe788) at dl-open.c:605
#7  0x00007ffff6550f96 in dlopen_doit (a=a@entry=0x7fffffffe1b0) at dlopen.c:66
#8  0x00007ffff71e951f in __GI__dl_catch_exception (
    exception=exception@entry=0x7fffffffe150, 
    operate=0x7ffff6550f40 <dlopen_doit>, args=0x7fffffffe1b0)
    at dl-error-skeleton.c:196
#9  0x00007ffff71e95af in __GI__dl_catch_error (objname=0x55555575a270, 
    errstring=0x55555575a278, mallocedp=0x55555575a268, 
    operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#10 0x00007ffff6551745 in _dlerror_run (
    operate=operate@entry=0x7ffff6550f40 <dlopen_doit>, 
    args=args@entry=0x7fffffffe1b0) at dlerror.c:162
#11 0x00007ffff6551051 in __dlopen (file=<optimized out>, mode=<optimized out>)
    at dlopen.c:87
#12 0x00007ffff6e59606 in ?? () from /usr/lib/x86_64-linux-gnu/libGLX.so.0
#13 0x00007ffff6e5a958 in ?? () from /usr/lib/x86_64-linux-gnu/libGLX.so.0
#14 0x00007ffff6e54231 in glXChooseVisual ()
   from /usr/lib/x86_64-linux-gnu/libGLX.so.0
#15 0x000055555555758b in ?? ()
#16 0x0000555555555a87 in ?? ()
#17 0x00007ffff70a3b97 in __libc_start_main (main=0x555555555930, argc=1, 
    argv=0x7fffffffe778, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe768)
    at ../csu/libc-start.c:310
#18 0x000055555555641a in ?? ()

Thus it crashes inside /usr/lib/x86_64-linux-gnu/libGLX.so.0, the system libGLX, whereas on the host machine there is a nvidia-specific libGLX in /usr/lib/nvidia-384/libGLX.so.0, which is not mounted in the image /.singularity/libs/. Maybe this is the missing item.

denisri commented 4 years ago

Bingo ! When I mount the host filesystem in the image in /host, in a situation where glxgears crashes:

denis@averell $ glxgears 
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

Then I do:

denis@averell $ LD_LIBRARY_PATH=/host/usr/lib/nvidia-384:/casa/host/build/lib:/casa/host/lib:/.singularity.d/libs:/usr/local/lib glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.

and it works. (see, I have prepended /host/usr/lib/nvidia-384, the host drivers libs directory, to LD_LIBRARY_PATH). So obviously, nvidia-container-cli is not doing completely its job.

denisri commented 4 years ago

If I remove nvidia-container-cli from my system (still on ubuntu 16.04), I still get random behaviors, but differently:

denis@averell $ glxgears 
Segmentation fault (core dumped)

(no "stack smashing" something). with the following mounted libs:

denis@averell $ ls /.singularity.d/libs
libEGL.so                 libnvidia-cfg.so.1
libEGL.so.1               libnvidia-compiler.so
libEGL_nvidia.so.0        libnvidia-compiler.so.384.130
libGL.so                  libnvidia-egl-wayland.so.1.0.1
libGL.so.1                libnvidia-eglcore.so.384.130
libGLESv1_CM.so           libnvidia-encode.so
libGLESv1_CM.so.1         libnvidia-encode.so.1
libGLESv1_CM_nvidia.so.1  libnvidia-fatbinaryloader.so.384.130
libGLESv2.so              libnvidia-fbc.so
libGLESv2.so.2            libnvidia-fbc.so.1
libGLESv2_nvidia.so.2     libnvidia-glcore.so.384.130
libGLX.so                 libnvidia-glsi.so.384.130
libGLX.so.0               libnvidia-gtk2.so.361.42
libGLX_nvidia.so.0        libnvidia-gtk3.so.361.42
libGLdispatch.so.0        libnvidia-ifr.so
libOpenCL.so.1            libnvidia-ifr.so.1
libOpenGL.so              libnvidia-ml.so
libOpenGL.so.0            libnvidia-ml.so.1
libcuda.so                libnvidia-opencl.so.1
libcuda.so.1              libnvidia-ptxjitcompiler.so.1
libnvcuvid.so             libnvidia-tls.so.384.130
libnvcuvid.so.1           libnvidia-wfb.so.1
libnvidia-cfg.so          libvdpau_nvidia.so

and gdb says:

denis@averell $ gdb glxgears
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from glxgears...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/glxgears 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7059975 in __GI__IO_link_in (fp=fp@entry=0x55555575bdc0)
    at genops.c:92
92      genops.c: No such file or directory.
(gdb) bt
#0  0x00007ffff7059975 in __GI__IO_link_in (fp=fp@entry=0x55555575bdc0)
    at genops.c:92
#1  0x00007ffff70581a2 in _IO_new_file_init_internal (
    fp=fp@entry=0x55555575bdc0) at fileops.c:114
#2  0x00007ffff704af07 in __fopen_internal (is32=1, mode=0x7ffff46ccacb "rb", 
    filename=0x7fffffffee50 "/casa/host/home/.Xauthority") at iofopen.c:74
#3  _IO_new_fopen (filename=0x7fffffffee50 "/casa/host/home/.Xauthority", 
    mode=0x7ffff46ccacb "rb") at iofopen.c:89
#4  0x00007ffff46cc1e9 in XauGetBestAuthByAddr ()
   from /usr/lib/x86_64-linux-gnu/libXau.so.6
#5  0x00007ffff48de76f in ?? () from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#6  0x00007ffff48de909 in ?? () from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#7  0x00007ffff48de453 in xcb_connect_to_display_with_auth_info ()
   from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#8  0x00007ffff73fa522 in _XConnectXCB ()
   from /usr/lib/x86_64-linux-gnu/libX11.so.6
#9  0x00007ffff73eaeb2 in XOpenDisplay ()
   from /usr/lib/x86_64-linux-gnu/libX11.so.6
#10 0x000055555555639d in ?? ()
#11 0x00007ffff6fedb97 in __libc_start_main (main=0x555555555930, argc=1, 
    argv=0x7fffffffe778, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe768)
    at ../csu/libc-start.c:310
#12 0x000055555555641a in ?? ()

so this time it seems it crashes in Xlib/xcb while reading /casa/host/home/.Xauthority. Strangely if I use another X program (like gedit) it works without a problem. And if I use the host libs as above:

denis@averell $ LD_LIBRARY_PATH=/host/usr/lib/nvidia-384:/casa/host/build/lib:/casa/host/lib:/.singularity.d/libs:/usr/local/lib glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.

then it works ! So in all cases singularity doesn't mount all needed libraries.

ylep commented 4 years ago

This is crazy... it must have something to do with the image, because my older casa-dev image used to work perfectly, and it stopped working as soon as I used pull_image. Too bad, I cannot get back to the old image because it was overwritten. Also I do not see anything suspicious in the recent history of image-building scripts.

denisri commented 4 years ago

This is crazy...

Indeed ! I can better isolate the problem (still in the container causing segfaults):

mkdir /tmp/libs
ln -s /host/usr/lib/nvidia-384/tls /tmp/libs
LD_LIBRARY_PATH=/tmp/libs:/casa/host/build/lib:/casa/host/lib:/.singularity.d/libs:/usr/local/lib glxgears

works. Thus it is the tls directory being in LD_LIBRARY_PATH that leads to the different behaviour. This directory contains a libnvidia-tls.so.384.130, which is different from the one in its parent directory (with the exact same filename):

ls -als /host/usr/lib/nvidia-384/libnvidia-tls.so.384.130 /host/usr/lib/nvidia-384/tls/libnvidia-tls.so.384.130
16 -rw-r--r-- 1 root root 13080 Mar 21  2018 /host/usr/lib/nvidia-384/libnvidia-tls.so.384.130
16 -rw-r--r-- 1 root root 14480 Mar 21  2018 /host/usr/lib/nvidia-384/tls/libnvidia-tls.so.384.130

The first one is identical to the one mounted in /.singularity/libs.

denisri commented 4 years ago

Too bad, I cannot get back to the old image

If you like, I have one dating oct 1. on my machine in neurospin: is234199 in /volatile/riviere/casa_distro/casa-dev-ubuntu-18.04.sif

ylep commented 4 years ago

If you like, I have one dating oct 1. on my machine in neurospin: is234199 in /volatile/riviere/casa_distro/casa-dev-ubuntu-18.04.sif

Thanks, but that one also has the same issue

$ casa_distro run opengl=nv image=casa-dev-OCT1-ubuntu-18.04.sif glxgears
*** stack smashing detected ***: <unknown> terminated
Aborted

denisri commented 4 years ago

I don't have an older one. But thus does it really depend on the image ?

ylep commented 4 years ago

But thus does it really depend on the image ?

It has to: nothing has changed in my setup besides running pull_image, and the *** stack smashing detected *** errors started occurring right at that moment. I even checked in the dpkg logs that no update of NVidia drivers happened in the background.

I hope that this is not a problem in the reproducibility of image builds...

denisri commented 4 years ago

At home (ubuntu 18.04 + driver 390.138) same story:

riviere@gargamel $ ls /.singularity.d/libs/
libEGL_nvidia.so.0             libnvidia-fatbinaryloader.so.390.138
libGLESv1_CM_nvidia.so.1       libnvidia-fbc.so
libGLESv2_nvidia.so.2          libnvidia-fbc.so.1
libGLX_nvidia.so.0             libnvidia-glcore.so.390.138
libcuda.so                     libnvidia-glsi.so.390.138
libcuda.so.1                   libnvidia-ifr.so
libnvcuvid.so                  libnvidia-ifr.so.1
libnvcuvid.so.1                libnvidia-ml.so
libnvidia-cfg.so               libnvidia-ml.so.1
libnvidia-cfg.so.1             libnvidia-opencl.so.1
libnvidia-compiler.so.390.138  libnvidia-ptxjitcompiler.so
libnvidia-eglcore.so.390.138   libnvidia-ptxjitcompiler.so.1
libnvidia-encode.so            libnvidia-tls.so.390.138
libnvidia-encode.so.1
riviere@gargamel $ glxgears 
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
riviere@gargamel $ LD_LIBRARY_PATH=/host/usr/lib/x86_64-linux-gnu/tls:"$LD_LIBRARY_PATH" glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
302 frames in 5.0 seconds = 60.305 FPS

so it really looks like nvidia-container-cli misses a library and it's a matter of exporting/adding the tls directory or its content (libnvidia-tls.so.390.138) available in the container LD_LIBRARY_PATH.

denisri commented 4 years ago

It's not nvidia-container-cli, it's singularity ;)

gargamel:riviere% nvidia-container-cli list --libraries
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.390.138
/usr/lib/x86_64-linux-gnu/libcuda.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.138
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.390.138
/usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.390.138  # <--- in tls/
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.390.138
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.390.138
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.390.138
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.390.138
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.390.138
gargamel:riviere% ll /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.390.138
16 -rw-r--r-- 1 root root 14480 mai   14 13:01 /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.390.138

in the container:

riviere@gargamel $ ll /.singularity.d/libs/libnvidia-tls.so.390.138 
16 -rw-r--r-- 1 root root 13080 May 14 13:01 /.singularity.d/libs/libnvidia-tls.so.390.138

(not the same, it's the parent dir one)

denisri commented 4 years ago

This last commit is a workaround the problem, which mounts the tls lib directory, and inserts it first in the LD_LIBRARY_PATH. It seems to fix the problem on my machines. Tell me if you still have crashes.

ylep commented 4 years ago

Your workaround works for me, so I will close the issue. Thanks @denisri!

The reason why it stopped working for me right when I updated my image will remain a mystery...

brainvisa / casa-distro

Compiled commands randomly fail to start (segmentation fault) with opengl=nv when nvidia-container-cli is missing #153