Closed ylep closed 4 years ago
Also anatomist
, and even AimsFileInfo
exhibit the same behaviour on my workstation
(venv) ➜ ~ % casa_distro run verbose=1 AimsFileInfo
----------------------------------------
Running singularity with the following command:
'singularity' 'run' '--cleanenv' '--pwd' '/casa/host/home' '--bind' '/mnt:/mnt' '--bind' '/volatile2:/volatile2' '--bind' '/media:/media' '--bind' '/volatile:/volatile' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host/home:/casa/home' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host:/casa/host' '--bind' '/neurospin:/neurospin' '--bind' '/srv:/srv' '--bind' '/i2bm:/i2bm' '--bind' '/home/yl243478' '--home' '/casa/host/home' '--nv' '--env' 'PS1=\[\033[33m\]\u@\h \$\[\033[0m\] ' '/volatile/bv/casa-distro-3-repo/casa-dev-ubuntu-18.04.sif' 'AimsFileInfo'
Using the following environment:
[...]
SINGULARITYENV_CASA_BRANCH=bug_fix
SINGULARITYENV_CASA_DISTRO=brainvisa-dev-ubuntu-18.04
SINGULARITYENV_CASA_HOST_DIR=/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04
SINGULARITYENV_CASA_SYSTEM=ubuntu-18.04
SINGULARITYENV_DISPLAY=:0
SINGULARITYENV_XAUTHORITY=/casa/host/home/.Xauthority
[...]
----------------------------------------
AimsFileInfo: value missing for option "-i"
(venv) (1)➜ ~ % casa_distro run verbose=1 AimsFileInfo
----------------------------------------
Running singularity with the following command:
'singularity' 'run' '--cleanenv' '--pwd' '/casa/host/home' '--bind' '/i2bm:/i2bm' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host/home:/casa/home' '--bind' '/mnt:/mnt' '--bind' '/media:/media' '--bind' '/srv:/srv' '--bind' '/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04/host:/casa/host' '--bind' '/neurospin:/neurospin' '--bind' '/volatile2:/volatile2' '--bind' '/volatile:/volatile' '--bind' '/home/yl243478' '--home' '/casa/host/home' '--nv' '--env' 'PS1=\[\033[33m\]\u@\h \$\[\033[0m\] ' '/volatile/bv/casa-distro-3-repo/casa-dev-ubuntu-18.04.sif' 'AimsFileInfo'
Using the following environment:
[...]
SINGULARITYENV_CASA_BRANCH=bug_fix
SINGULARITYENV_CASA_DISTRO=brainvisa-dev-ubuntu-18.04
SINGULARITYENV_CASA_HOST_DIR=/volatile/bv/casa-distro-3-repo/brainvisa-dev-ubuntu-18.04
SINGULARITYENV_CASA_SYSTEM=ubuntu-18.04
SINGULARITYENV_DISPLAY=:0
SINGULARITYENV_XAUTHORITY=/casa/host/home/.Xauthority
[...]
----------------------------------------
Segmentation fault
I don't get this behaviour (but I have not updated my build yet). Coud you please run:
casa_distro run verbose=1 gdb AimsFileInfo
then run the programs in gdb
to get a traceback ?
casa_distro run verbose=1 gdb AimsFileInfo
(gdb) bt
#0 0x00007ffff337e76e in pthread_mutex_init (mutex=0x5555558361f0, mutexattr=0x0) at forward.c:188
#1 0x00007fffe40176d2 in QWaitCondition::QWaitCondition() () at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#2 0x00007fffe3082820 in () at /usr/lib/x86_64-linux-gnu/libQt5Gui.so.5
#3 0x00007ffff7de5783 in call_init (env=0x7fffffffe768, argv=0x7fffffffe758, argc=1, l=<optimized out>) at dl-init.c:72
#4 0x00007ffff7de5783 in _dl_init (main_map=main_map@entry=0x55555582e4c0, argc=1, argv=0x7fffffffe758, env=0x7fffffffe768)
at dl-init.c:119
#5 0x00007ffff7dea24f in dl_open_worker (a=a@entry=0x7fffffffdc90) at dl-open.c:522
#6 0x00007ffff33b551f in __GI__dl_catch_exception (exception=0x7fffffffdc70, operate=0x7ffff7de9e10 <dl_open_worker>, args=0x7fffffffdc90) at dl-error-skeleton.c:196
#7 0x00007ffff7de981a in _dl_open (file=0x5555557c5fa0 "libaimsqsqlgraphformat.so.4.6.2", mode=-2147483390, caller_dlopen=0x7ffff5079750 <carto::PluginLoader::loadPluginFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+2896>, nsid=<optimized out>, argc=1, argv=<optimized out>, env=0x7fffffffe768) at dl-open.c:605
#8 0x00007ffff0843f96 in dlopen_doit (a=a@entry=0x7fffffffdec0) at dlopen.c:66
#9 0x00007ffff33b551f in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffde60, operate=0x7ffff0843f40 <dlopen_doit>, args=0x7fffffffdec0) at dl-error-skeleton.c:196
#10 0x00007ffff33b55af in __GI__dl_catch_error (objname=0x5555557c6040, errstring=0x5555557c6048, mallocedp=0x5555557c6038, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#11 0x00007ffff0844745 in _dlerror_run (operate=operate@entry=0x7ffff0843f40 <dlopen_doit>, args=args@entry=0x7fffffffdec0)
at dlerror.c:162
#12 0x00007ffff0844051 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#13 0x00007ffff5079750 in carto::PluginLoader::loadPluginFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) ()
at /casa/host/build/lib/libsoma-io.so.4.6.2
#14 0x00007ffff5079f67 in carto::PluginLoader::load(int, bool) () at /casa/host/build/lib/libsoma-io.so.4.6.2
#15 0x00007ffff504bf5b in carto::CartoApplication::initialize() () at /casa/host/build/lib/libsoma-io.so.4.6.2
#16 0x00007ffff6c209af in aims::AimsApplication::initialize() () at /casa/host/build/lib/libaimsdata.so.4.6.2
#17 0x0000555555560179 in main ()
So it crashes in Qt/pthread while loading the IO module libaimsqsqlgraphformat
. This module is actually linked against Qt, itself linked against OpenGL - not sure at all it has anything to do with the problem here but it is a "usual suspect" when things are not working well ;)
But here it's rather in threads that the problem occurs. I don't know what to think about that...
I can confirm that the issue occurs only with opengl=nv
(or auto
). It does not occur with opengl=container
or opengl=software
.
I have version 384.130 of the nvidia driver, on Ubuntu 16.04.7.
Also, I am using Python 3.5.2 to run casa-distro, which means that the ordering of dictionary iteration is randomized. This might be the cause of the randomness
Alas, I get the same error with Python 2. Indeed, the same environment being run with the exact same singularity
command-line randomly succeeds or fails.
However, I noticed some interesting behaviour: in a casa-distro shell where the segfault occurs, it will occur consistently every time.
yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault
yl243478@is234203 $ AimsFileInfo
Segmentation fault
In a shell where it does not occur, it will never occur.
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
yl243478@is234203 $ AimsFileInfo
AimsFileInfo: value missing for option "-i"
I have a similar situation at home for OpenGL commands. Inside a given casa-distro shell, things are consistent (all work or all segfault), but from one run of casa_distro to another, it's completely random. Is it a problem in Singularity ? In our setup or images ? In the nvidia driver ? How can we tell ? This erratic behaviour is a serious problem: if we release a new brainvisa version in this situation, many users will complain (or drop the software) and get a very bad opinion of it.
Can you remind us what version are your Ubuntu system and your nvidia driver on your machine?
It's Ubuntu 18.04 with driver 390.138 I think (not totally sure which one is actually loaded now, I am using a remote connection to it)
I found a way to fix the issue on my machine: by installing nvidia-container-cli
(the libnvidia-container-tools
package) as described on https://nvidia.github.io/libnvidia-container/
It seems that Singularity is able to use that tool from NVidia (since Singularity 2.6, see the release notes) in order to better configure the GPU in the container. I have no idea what it actually does.
So, I will change the behaviour of opengl=auto
to only activate --nv
if nvidia-container-cli
is present.
How did you do that ? I'm following the instructions on https://nvidia.github.io/libnvidia-container/, I have added the repository, but it seems empty - no nvidia-container-tools
package. When I get to the URL of the repository, https://nvidia.github.io/libnvidia-container/stable/ubuntu16.04/amd64/ using a web browser, I just see the message:
# Unsupported distribution! # Check https://nvidia.github.io/libnvidia-container
I'm on an ubuntu 16.04 laptop, which is actually listed as supported...
No more luck on a ubuntu 18.04 machine...
@denisri My bad, the package is actually named libnvidia-container-tools
. Sorry.
Oh thanks. Search tools for apt are so poor that I couldn't find it on my own... I had to install an older version on my laptop since the latest are not compatible with older drivers (https://github.com/NVIDIA/nvidia-docker/issues/1280). Do we just have to install the package, and that's all ? Or do we need to use it manually to configure anything ? In other words, have you understood a little bit what this tool does and how it is used (automatically ?) by singularity ? By chance can singularity use it inside a container, if we install it in the container (this would actually be great) ?
I have just installed the package and it magically works. I have not yet looked at all at how the tool works. I just noticed that it fixes another failure scenario: when you use --nv
on a X server that cannot access NVidia hardware (Xvnc
or x2go
) it will stop some of the NVidia libs from being loaded, which allows the software to work (with mesa) instead of crashing consistently.
Well, after installing libnvidia-container-tools
version 1.0.7 (which seems to work), singularity couldn't run glxgears
(X errors) at all using --nv
. After searching a bit (https://github.com/hpcng/singularity/pull/1681/commits/fa7162ca91df6cfbebeed54aa8bde9958169b765) it seems that nvidia-container-cli
is called with options list -cguv
, which do not exist (probably did not in this version), and this makes the whole thing get wrong (in the container, in /.singularity/libs
I see some nvidia and cuda libs, but no libGL).
So this solution only works for recent versions of libnvidia-container-tools
, itself working only with recent nvidia drivers (at least until they release a newer version of the tool fixing the driver version problem). Thus we maybe have to test the tool itself before using --nv
in the option opengl=auto
in casa_distro
.
The options -cguv
don't exist in recent nvidia-container-cli
either so the code above must be outdated. Anyway using older versions of the tool, singularity doesn't work correctly. Installing a newer version (not working on my system) doesn't seem to harm in singularity
(but is not helpful, thus).
It seems that current Singularity only calls nvidia-container-cli
in two ways, in order to retrieve the list of files that it will mount into the container:
nvidia-container-cli list --binaries --libraries
nvidia-container-cli list --ipcs
https://github.com/hpcng/singularity/blob/v3.6.3/pkg/util/gpu/paths.go#L91-L93 https://github.com/hpcng/singularity/blob/v3.6.3/pkg/util/gpu/paths.go#L230
If there is an incompatibility between certain versions of Singularity and nvidia-container-cli
, this is clearly a bug in Singularity, maybe we should document it... or auto-detect it if there is an easy way
OK the bug is in nvidia-container-cli
(1.0.7) then: on an ubuntu 16.04 host, it lists some nvidia libraries, but not libGL, which in this case doesn't get mounted in the container. If this is OK on recent distributions / drivers (a single libGL on the systems, using libgldispatch to switch to an implementation), it was not implemented this way on ubuntu 16 + old drivers (not sure the driver version matters since the system libGL doesn't use libgldispatch yet).
So my bet is that nvidia-container-cli
did just not work on ubuntu 16.04 (do recent versions work with newer drivers, if there are newer drivers for ubuntu 16 ?)
Testing it is a bit tricky: we should check whether nvidia-container-cli list --libraries
contains libGL or not, and if not, check if the system libGL links against libgldispatch or not. If both are not, it will presumably not work. But then we end up doing a part of nvidia-container-cli list
's work...
So my bet is that
nvidia-container-cli
did just not work on ubuntu 16.04 (do recent versions work with newer drivers, if there are newer drivers for ubuntu 16 ?)
nvidia-container-cli
version 1.3.0 works perfectly on my Ubuntu 16.04 workstation, which is using version 384.130 of the NVidia driver.
Can't you upgrade your NVidia driver, in order to use a recent nvidia-container-cli
? I think that if the NVidia software/drivers are broken, we just cannot support them. If people need an old driver to support old hardware, then software rendering is the only option.
it lists some nvidia libraries, but not libGL, which in this case doesn't get mounted in the container.
Oh, I just noticed that it does not include libGL
either on my workstation:
% nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia-384/bin/nvidia-smi
/usr/lib/nvidia-384/bin/nvidia-debugdump
/usr/lib/nvidia-384/bin/nvidia-persistenced
/usr/lib/nvidia-384/bin/nvidia-cuda-mps-control
/usr/lib/nvidia-384/bin/nvidia-cuda-mps-server
/usr/lib/nvidia-384/libnvidia-ml.so.384.130
/usr/lib/nvidia-384/libnvidia-cfg.so.384.130
/usr/lib/x86_64-linux-gnu/libcuda.so.384.130
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.384.130
/usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.384.130
/usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.130
/usr/lib/nvidia-384/libnvidia-compiler.so.384.130
/usr/lib/nvidia-384/vdpau/libvdpau_nvidia.so.384.130
/usr/lib/nvidia-384/libnvidia-encode.so.384.130
/usr/lib/nvidia-384/libnvcuvid.so.384.130
/usr/lib/nvidia-384/libnvidia-eglcore.so.384.130
/usr/lib/nvidia-384/libnvidia-glcore.so.384.130
/usr/lib/nvidia-384/tls/libnvidia-tls.so.384.130
/usr/lib/nvidia-384/libnvidia-glsi.so.384.130
/usr/lib/nvidia-384/libnvidia-fbc.so.384.130
/usr/lib/nvidia-384/libnvidia-ifr.so.384.130
/usr/lib/nvidia-384/libGLX_nvidia.so.384.130
/usr/lib/nvidia-384/libEGL_nvidia.so.384.130
/usr/lib/nvidia-384/libGLESv2_nvidia.so.384.130
/usr/lib/nvidia-384/libGLESv1_CM_nvidia.so.384.130
/usr/lib32/nvidia-384/libnvidia-ml.so.384.130
/usr/lib32/nvidia-384/libnvidia-cfg.so.384.130
/usr/lib/i386-linux-gnu/libcuda.so.384.130
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.384.130
/usr/lib32/nvidia-384/libnvidia-ptxjitcompiler.so.384.130
/usr/lib32/nvidia-384/libnvidia-fatbinaryloader.so.384.130
/usr/lib32/nvidia-384/libnvidia-compiler.so.384.130
/usr/lib32/nvidia-384/vdpau/libvdpau_nvidia.so.384.130
/usr/lib32/nvidia-384/libnvidia-encode.so.384.130
/usr/lib32/nvidia-384/libnvcuvid.so.384.130
/usr/lib32/nvidia-384/libnvidia-eglcore.so.384.130
/usr/lib32/nvidia-384/libnvidia-glcore.so.384.130
/usr/lib32/nvidia-384/tls/libnvidia-tls.so.384.130
/usr/lib32/nvidia-384/libnvidia-glsi.so.384.130
/usr/lib32/nvidia-384/libnvidia-fbc.so.384.130
/usr/lib32/nvidia-384/libnvidia-ifr.so.384.130
/usr/lib32/nvidia-384/libGLX_nvidia.so.384.130
/usr/lib32/nvidia-384/libEGL_nvidia.so.384.130
/usr/lib32/nvidia-384/libGLESv2_nvidia.so.384.130
/usr/lib32/nvidia-384/libGLESv1_CM_nvidia.so.384.130
/run/nvidia-persistenced/socket
Not sure why the driver hasn't get updated. I think some new drivers versions can drop support for some (older) hardware, thus cannot be updated, but I don't know if it is my situation here. I'll check that.
So you don't have a libGL in /.singularity/libs
? And it works ? How does it do ?
Anyway you have much more libs here than I have on my system...
After upgrading the driver to 384.130, nvidia-container-cli
starts working (this is the good news).
The bad news is that, now, programs sometimes crash (whereas they never did with the older driver 340):
denis@averell $ ls /.singularity.d/libs
libEGL_nvidia.so.0 libnvidia-encode.so.1
libGLESv1_CM_nvidia.so.1 libnvidia-fatbinaryloader.so.384.130
libGLESv2_nvidia.so.2 libnvidia-fbc.so
libGLX_nvidia.so.0 libnvidia-fbc.so.1
libcuda.so libnvidia-glcore.so.384.130
libcuda.so.1 libnvidia-glsi.so.384.130
libnvcuvid.so libnvidia-ifr.so
libnvcuvid.so.1 libnvidia-ifr.so.1
libnvidia-cfg.so libnvidia-ml.so
libnvidia-cfg.so.1 libnvidia-ml.so.1
libnvidia-compiler.so libnvidia-opencl.so.1
libnvidia-compiler.so.384.130 libnvidia-ptxjitcompiler.so.1
libnvidia-eglcore.so.384.130 libnvidia-tls.so.384.130
libnvidia-encode.so libvdpau_nvidia.so
denis@averell $ glxgears
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
What method are you using to access your desktop remotely? I previously found that I got different behaviour when using x11vnc with a physical X server, turbovnc, or x2go.
However, since I installed nvidia-container-cli
, OpenGL works consistently with --nv
under these 3 setups, because it is able to fall back to software rendering (using the APT-installed mesa in the container) when I am not using a physical X server.
If we have different behaviour, we should find what is the difference between our setups. One obvious difference is that you are under Ubuntu 18.04 and I am under Ubuntu 16.04. We may not have exactly the same version of the casa-dev
image, I am running a pull_image
right now and I will test again.
Edit: @denisri I just ran pull_image
and now I have the same behaviour as you (random failures). Too bad I did not keep the old image to run a diff... I will investigate what changed in the images recently.
Here I'm speaking of my laptop, locally, and it is running ubuntu 16.04 (I have another machine running ubuntu 18.04, which I currently access remotely, so you're right I was not clear).
So now nvidia-container-cli
seems to be working on ubuntu 16.04 + driver 384, and I now get random crashes. Before I upgrade the driver (I was using 340) I didn't experience any unstability on this laptop (without the need for nvidia-container-cli
). So for me here, nvidia-container-cli
doesn't seem to really help...
At home on the ubuntu 18.04 machine I did experience crashes but I had not installed nvidia-container-cli
yet. I can't really test that remotely on this machine (well I could perhaps use x11vnc
or another remote desktop system, but I have not for now).
Now could there be a link with the image ?
When I backtrace a crashing program (glxgears), I get:
denis@averell $ gdb glxgears
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from glxgears...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/glxgears
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
*** stack smashing detected ***: <unknown> terminated
Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff70c28b1 in __GI_abort () at abort.c:79
#2 0x00007ffff710b907 in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7ffff7238be8 "*** %s ***: %s terminated\n")
at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff71b6e81 in __GI___fortify_fail_abort (
need_backtrace=need_backtrace@entry=false,
msg=msg@entry=0x7ffff7238bc6 "stack smashing detected")
at fortify_fail.c:33
#4 0x00007ffff71b6e42 in __stack_chk_fail () at stack_chk_fail.c:29
#5 0x00007ffff71e957a in __GI__dl_catch_exception (exception=0x7fffffffdf60,
operate=0x7ffff7de9e10 <dl_open_worker>, args=0x7fffffffdf80)
at dl-error-skeleton.c:207
#6 0x00007ffff7de981a in _dl_open (file=0x55555576b0a0 "libGLX_nvidia.so.0",
mode=-2147483647, caller_dlopen=0x7ffff6e59606, nsid=<optimized out>,
argc=1, argv=<optimized out>, env=0x7fffffffe788) at dl-open.c:605
#7 0x00007ffff6550f96 in dlopen_doit (a=a@entry=0x7fffffffe1b0) at dlopen.c:66
#8 0x00007ffff71e951f in __GI__dl_catch_exception (
exception=exception@entry=0x7fffffffe150,
operate=0x7ffff6550f40 <dlopen_doit>, args=0x7fffffffe1b0)
at dl-error-skeleton.c:196
#9 0x00007ffff71e95af in __GI__dl_catch_error (objname=0x55555575a270,
errstring=0x55555575a278, mallocedp=0x55555575a268,
operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#10 0x00007ffff6551745 in _dlerror_run (
operate=operate@entry=0x7ffff6550f40 <dlopen_doit>,
args=args@entry=0x7fffffffe1b0) at dlerror.c:162
#11 0x00007ffff6551051 in __dlopen (file=<optimized out>, mode=<optimized out>)
at dlopen.c:87
#12 0x00007ffff6e59606 in ?? () from /usr/lib/x86_64-linux-gnu/libGLX.so.0
#13 0x00007ffff6e5a958 in ?? () from /usr/lib/x86_64-linux-gnu/libGLX.so.0
#14 0x00007ffff6e54231 in glXChooseVisual ()
from /usr/lib/x86_64-linux-gnu/libGLX.so.0
#15 0x000055555555758b in ?? ()
#16 0x0000555555555a87 in ?? ()
#17 0x00007ffff70a3b97 in __libc_start_main (main=0x555555555930, argc=1,
argv=0x7fffffffe778, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffe768)
at ../csu/libc-start.c:310
#18 0x000055555555641a in ?? ()
Thus it crashes inside /usr/lib/x86_64-linux-gnu/libGLX.so.0
, the system libGLX, whereas on the host machine there is a nvidia-specific libGLX in /usr/lib/nvidia-384/libGLX.so.0
, which is not mounted in the image /.singularity/libs/
. Maybe this is the missing item.
Bingo ! When I mount the host filesystem in the image in /host, in a situation where glxgears crashes:
denis@averell $ glxgears
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
Then I do:
denis@averell $ LD_LIBRARY_PATH=/host/usr/lib/nvidia-384:/casa/host/build/lib:/casa/host/lib:/.singularity.d/libs:/usr/local/lib glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
and it works. (see, I have prepended /host/usr/lib/nvidia-384
, the host drivers libs directory, to LD_LIBRARY_PATH
).
So obviously, nvidia-container-cli
is not doing completely its job.
If I remove nvidia-container-cli
from my system (still on ubuntu 16.04), I still get random behaviors, but differently:
denis@averell $ glxgears
Segmentation fault (core dumped)
(no "stack smashing" something). with the following mounted libs:
denis@averell $ ls /.singularity.d/libs
libEGL.so libnvidia-cfg.so.1
libEGL.so.1 libnvidia-compiler.so
libEGL_nvidia.so.0 libnvidia-compiler.so.384.130
libGL.so libnvidia-egl-wayland.so.1.0.1
libGL.so.1 libnvidia-eglcore.so.384.130
libGLESv1_CM.so libnvidia-encode.so
libGLESv1_CM.so.1 libnvidia-encode.so.1
libGLESv1_CM_nvidia.so.1 libnvidia-fatbinaryloader.so.384.130
libGLESv2.so libnvidia-fbc.so
libGLESv2.so.2 libnvidia-fbc.so.1
libGLESv2_nvidia.so.2 libnvidia-glcore.so.384.130
libGLX.so libnvidia-glsi.so.384.130
libGLX.so.0 libnvidia-gtk2.so.361.42
libGLX_nvidia.so.0 libnvidia-gtk3.so.361.42
libGLdispatch.so.0 libnvidia-ifr.so
libOpenCL.so.1 libnvidia-ifr.so.1
libOpenGL.so libnvidia-ml.so
libOpenGL.so.0 libnvidia-ml.so.1
libcuda.so libnvidia-opencl.so.1
libcuda.so.1 libnvidia-ptxjitcompiler.so.1
libnvcuvid.so libnvidia-tls.so.384.130
libnvcuvid.so.1 libnvidia-wfb.so.1
libnvidia-cfg.so libvdpau_nvidia.so
and gdb says:
denis@averell $ gdb glxgears
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from glxgears...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/glxgears
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7059975 in __GI__IO_link_in (fp=fp@entry=0x55555575bdc0)
at genops.c:92
92 genops.c: No such file or directory.
(gdb) bt
#0 0x00007ffff7059975 in __GI__IO_link_in (fp=fp@entry=0x55555575bdc0)
at genops.c:92
#1 0x00007ffff70581a2 in _IO_new_file_init_internal (
fp=fp@entry=0x55555575bdc0) at fileops.c:114
#2 0x00007ffff704af07 in __fopen_internal (is32=1, mode=0x7ffff46ccacb "rb",
filename=0x7fffffffee50 "/casa/host/home/.Xauthority") at iofopen.c:74
#3 _IO_new_fopen (filename=0x7fffffffee50 "/casa/host/home/.Xauthority",
mode=0x7ffff46ccacb "rb") at iofopen.c:89
#4 0x00007ffff46cc1e9 in XauGetBestAuthByAddr ()
from /usr/lib/x86_64-linux-gnu/libXau.so.6
#5 0x00007ffff48de76f in ?? () from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#6 0x00007ffff48de909 in ?? () from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#7 0x00007ffff48de453 in xcb_connect_to_display_with_auth_info ()
from /usr/lib/x86_64-linux-gnu/libxcb.so.1
#8 0x00007ffff73fa522 in _XConnectXCB ()
from /usr/lib/x86_64-linux-gnu/libX11.so.6
#9 0x00007ffff73eaeb2 in XOpenDisplay ()
from /usr/lib/x86_64-linux-gnu/libX11.so.6
#10 0x000055555555639d in ?? ()
#11 0x00007ffff6fedb97 in __libc_start_main (main=0x555555555930, argc=1,
argv=0x7fffffffe778, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffe768)
at ../csu/libc-start.c:310
#12 0x000055555555641a in ?? ()
so this time it seems it crashes in Xlib/xcb while reading /casa/host/home/.Xauthority
.
Strangely if I use another X program (like gedit) it works without a problem.
And if I use the host libs as above:
denis@averell $ LD_LIBRARY_PATH=/host/usr/lib/nvidia-384:/casa/host/build/lib:/casa/host/lib:/.singularity.d/libs:/usr/local/lib glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
then it works ! So in all cases singularity doesn't mount all needed libraries.
This is crazy... it must have something to do with the image, because my older casa-dev
image used to work perfectly, and it stopped working as soon as I used pull_image
. Too bad, I cannot get back to the old image because it was overwritten. Also I do not see anything suspicious in the recent history of image-building scripts.
This is crazy...
Indeed ! I can better isolate the problem (still in the container causing segfaults):
mkdir /tmp/libs
ln -s /host/usr/lib/nvidia-384/tls /tmp/libs
LD_LIBRARY_PATH=/tmp/libs:/casa/host/build/lib:/casa/host/lib:/.singularity.d/libs:/usr/local/lib glxgears
works. Thus it is the tls
directory being in LD_LIBRARY_PATH
that leads to the different behaviour. This directory contains a libnvidia-tls.so.384.130
, which is different from the one in its parent directory (with the exact same filename):
ls -als /host/usr/lib/nvidia-384/libnvidia-tls.so.384.130 /host/usr/lib/nvidia-384/tls/libnvidia-tls.so.384.130
16 -rw-r--r-- 1 root root 13080 Mar 21 2018 /host/usr/lib/nvidia-384/libnvidia-tls.so.384.130
16 -rw-r--r-- 1 root root 14480 Mar 21 2018 /host/usr/lib/nvidia-384/tls/libnvidia-tls.so.384.130
The first one is identical to the one mounted in /.singularity/libs
.
Too bad, I cannot get back to the old image
If you like, I have one dating oct 1. on my machine in neurospin: is234199
in /volatile/riviere/casa_distro/casa-dev-ubuntu-18.04.sif
If you like, I have one dating oct 1. on my machine in neurospin:
is234199
in/volatile/riviere/casa_distro/casa-dev-ubuntu-18.04.sif
Thanks, but that one also has the same issue
$ casa_distro run opengl=nv image=casa-dev-OCT1-ubuntu-18.04.sif glxgears
*** stack smashing detected ***: <unknown> terminated
Aborted
I don't have an older one. But thus does it really depend on the image ?
But thus does it really depend on the image ?
It has to: nothing has changed in my setup besides running pull_image
, and the *** stack smashing detected ***
errors started occurring right at that moment. I even checked in the dpkg logs that no update of NVidia drivers happened in the background.
I hope that this is not a problem in the reproducibility of image builds...
At home (ubuntu 18.04 + driver 390.138) same story:
riviere@gargamel $ ls /.singularity.d/libs/
libEGL_nvidia.so.0 libnvidia-fatbinaryloader.so.390.138
libGLESv1_CM_nvidia.so.1 libnvidia-fbc.so
libGLESv2_nvidia.so.2 libnvidia-fbc.so.1
libGLX_nvidia.so.0 libnvidia-glcore.so.390.138
libcuda.so libnvidia-glsi.so.390.138
libcuda.so.1 libnvidia-ifr.so
libnvcuvid.so libnvidia-ifr.so.1
libnvcuvid.so.1 libnvidia-ml.so
libnvidia-cfg.so libnvidia-ml.so.1
libnvidia-cfg.so.1 libnvidia-opencl.so.1
libnvidia-compiler.so.390.138 libnvidia-ptxjitcompiler.so
libnvidia-eglcore.so.390.138 libnvidia-ptxjitcompiler.so.1
libnvidia-encode.so libnvidia-tls.so.390.138
libnvidia-encode.so.1
riviere@gargamel $ glxgears
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
riviere@gargamel $ LD_LIBRARY_PATH=/host/usr/lib/x86_64-linux-gnu/tls:"$LD_LIBRARY_PATH" glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
302 frames in 5.0 seconds = 60.305 FPS
so it really looks like nvidia-container-cli
misses a library and it's a matter of exporting/adding the tls
directory or its content (libnvidia-tls.so.390.138
) available in the container LD_LIBRARY_PATH
.
It's not nvidia-container-cli, it's singularity ;)
gargamel:riviere% nvidia-container-cli list --libraries
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.390.138
/usr/lib/x86_64-linux-gnu/libcuda.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.138
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.390.138
/usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.390.138 # <--- in tls/
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.390.138
/usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.390.138
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.390.138
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.390.138
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.390.138
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.390.138
gargamel:riviere% ll /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.390.138
16 -rw-r--r-- 1 root root 14480 mai 14 13:01 /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.390.138
in the container:
riviere@gargamel $ ll /.singularity.d/libs/libnvidia-tls.so.390.138
16 -rw-r--r-- 1 root root 13080 May 14 13:01 /.singularity.d/libs/libnvidia-tls.so.390.138
(not the same, it's the parent dir one)
This last commit is a workaround the problem, which mounts the tls
lib directory, and inserts it first in the LD_LIBRARY_PATH
. It seems to fix the problem on my machines. Tell me if you still have crashes.
Your workaround works for me, so I will close the issue. Thanks @denisri!
The reason why it stopped working for me right when I updated my image will remain a mystery...