apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.
https://github.com/apptainer/apptainer
Other
2.52k stars 424 forks source link

Headless OpenGL rendering using NVIDIA GPU and EGL: unable to detect display (EGLDisplay) #1635

Closed henzler closed 6 years ago

henzler commented 6 years ago

Version of Singularity:

2.5.1-master.gd6e81547 also tested: 2.4

Problem

Hello, I am trying to create an OpenGL context via following code:

#include <assert.h>
#include <EGL/egl.h>
#include <stdio.h>
#include <iostream>

#define EGL_EGLEXT_PROTOTYPES
#include <EGL/egl.h>
#include <EGL/eglext.h>

static const EGLint configAttribs[] = {
          EGL_SURFACE_TYPE, EGL_PBUFFER_BIT,
          EGL_BLUE_SIZE, 8,
          EGL_GREEN_SIZE, 8,
          EGL_RED_SIZE, 8,
          EGL_DEPTH_SIZE, 8,
          EGL_RENDERABLE_TYPE, EGL_OPENGL_BIT,
          EGL_NONE
  };

  static const int pbufferWidth = 9;
  static const int pbufferHeight = 9;

  static const EGLint pbufferAttribs[] = {
        EGL_WIDTH, pbufferWidth,
        EGL_HEIGHT, pbufferHeight,
        EGL_NONE,
  };

int main(int argc, char *argv[])
{

    static const int MAX_DEVICES = 4;
    EGLDeviceEXT eglDevs[MAX_DEVICES];
    EGLint numDevices;

    PFNEGLQUERYDEVICESEXTPROC eglQueryDevicesEXT =(PFNEGLQUERYDEVICESEXTPROC)
    eglGetProcAddress("eglQueryDevicesEXT");

    eglQueryDevicesEXT(MAX_DEVICES, eglDevs, &numDevices);

    printf("Detected %d devices\n", numDevices);

    PFNEGLGETPLATFORMDISPLAYEXTPROC eglGetPlatformDisplayEXT =  (PFNEGLGETPLATFORMDISPLAYEXTPROC)
    eglGetProcAddress("eglGetPlatformDisplayEXT");

    EGLDisplay eglDpy = eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, eglDevs[0], 0);

    //EGLDisplay eglDpy = eglGetDisplay(EGL_DEFAULT_DISPLAY);

    // 1. Initialize EGL
    std::cout << "EGL eglDpy: " << eglDpy << std::endl;

    EGLint major, minor;

    eglInitialize(eglDpy, &major, &minor);

    std::cout << "minor, major: " << minor << ", " << major << std::endl;

    // 2. Select an appropriate configuration
    EGLint numConfigs;
    EGLConfig eglCfg;

    eglChooseConfig(eglDpy, configAttribs, &eglCfg, 1, &numConfigs);
    std::cout << "EGL numConfigs: " << numConfigs << std::endl;
    std::cout << "EGL eglCfg: " << eglCfg << std::endl;

    // 3. Create a surface
    //EGLSurface eglSurf = eglCreatePbufferSurface(eglDpy, eglCfg, pbufferAttribs);
    //std::cout << "EGL surf: " << eglSurf << std::endl;

    // 4. Bind the API
    eglBindAPI(EGL_OPENGL_API);

    // 5. Create a context and make it current
    EGLContext eglCtx = eglCreateContext(eglDpy, eglCfg, EGL_NO_CONTEXT,NULL);

    assert (eglDpy != NULL);
    assert (eglCfg != NULL);
    //assert (eglSurf != NULL);

    eglMakeCurrent(eglDpy, EGL_NO_SURFACE, EGL_NO_SURFACE, eglCtx);

    std::cout << "EGL Ctx: " << eglCtx << std::endl;
    assert (eglCtx != NULL);

    // 6. Terminate EGL when finished
    eglTerminate(eglDpy);
    return 0;
}

### Expected behavior
On my host machine I get the following result:

Detected 4 devices EGL eglDpy: 0x5607ef979890 minor, major: 4, 1 EGL numConfigs: 1 EGL eglCfg: 0xcaf353 EGL Ctx: 0x5607ef9893f1


### Actual behavior
As soon as I go into a singularity container like this:

`singularity exec --nv -B /link:/link2,/link3/link4 pytorch.simg ./train.sh
`
Note: train.sh executes code above
Also note: I use the `--nv` option!

I receive following output:

Detected 0 devices EGL eglDpy: 0 minor, major: 32, 21907 EGL numConfigs: 0 EGL eglCfg: 0x10000ffff test: hello.cpp:80: int main(int, char**): Assertion `eglDpy != NULL' failed.

Steps to reproduce behavior

My singularity file is very simple and looks like this:

Bootstrap: docker
From: nvidia/cuda:9.0-runtime-ubuntu16.04
#From: ubuntu --> I have tried both

%post
    apt-get update
sudo singularity build pytorch.simg Singularity
singularity exec --nv -B /link:/link2,/link3/link4 pytorch.simg ./train.sh

So basically I am not doing anything to the "guest" system with the Singularity file, I only create an ubuntu image and everything else should be the same?

EDIT

As soon as I go into a singularity container like this:

singularity exec --nv pytorch.simg bash I get no results for:

find /usr -type f -name "libGL*"

On my host system however I get:

 /usr/local/cuda-8.0/samples/common/lib/linux/aarch64/libGLEW.a
    /usr/local/cuda-8.0/samples/common/lib/linux/x86_64/libGLEW.a
    /usr/local/cuda-8.0/samples/common/lib/linux/armv7l/libGLEW.a
    /usr/lib32/nvidia-384/libGLESv1_CM.so.1
    /usr/lib32/nvidia-384/libGLESv1_CM_nvidia.so.384.111
    /usr/lib32/nvidia-384/libGLdispatch.so.0
    /usr/lib32/nvidia-384/libGL.so.384.111
    /usr/lib32/nvidia-384/libGLX_nvidia.so.384.111
    /usr/lib32/nvidia-384/libGLESv2_nvidia.so.384.111
    /usr/lib32/nvidia-384/libGLESv2.so.2
    /usr/lib32/nvidia-384/libGL.la
    /usr/lib32/nvidia-384/libGLX.so.0
    /usr/lib/x86_64-linux-gnu/libGLEW.so.1.13.0
    /usr/lib/x86_64-linux-gnu/libGLU.so.1.3.1
    /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0
    /usr/lib/x86_64-linux-gnu/libGLU.a
    /usr/lib/x86_64-linux-gnu/mesa-egl/libGLESv2.so.2.0.0
    /usr/lib/nvidia-384/libGLESv1_CM.so.1
    /usr/lib/nvidia-384/libGLESv1_CM_nvidia.so.384.111
    /usr/lib/nvidia-384/libGLdispatch.so.0
    /usr/lib/nvidia-384/libGL.so.384.111
    /usr/lib/nvidia-384/libGLX_nvidia.so.384.111
    /usr/lib/nvidia-384/libGLESv2_nvidia.so.384.111
    /usr/lib/nvidia-384/libGLESv2.so.2
    /usr/lib/nvidia-384/libGLX.so.0

find -type f | wc -l

HOST: 4026 GUEST: 1602

Xingyu-Lin commented 6 years ago

Same issue here. Any followup?

jmstover commented 6 years ago

For GL ... what GL libraries is the program linking to? ldd /path/to/gl/binary

You may need to add some libraries into $sysconfdir/singularity/nvliblist.conf

As for the display ... You may need to add something like the following: export SINGULARITYENV_DISPLAY=${DISPLAY}

Xingyu-Lin commented 6 years ago

Here is the result for ldd /usr/bin/glxgears

linux-vdso.so.1 => (0x00007ffd1ebe7000) libGL.so.1 => /.singularity.d/libs/libGL.so.1 (0x00007fb8d8d48000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb8d8a32000) libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007fb8d86f8000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb8d832f000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb8d812a000) libGLX.so.0 => /.singularity.d/libs/libGLX.so.0 (0x00007fb8d7efa000) libGLdispatch.so.0 => /.singularity.d/libs/libGLdispatch.so.0 (0x00007fb8d7c2c000) /lib64/ld-linux-x86-64.so.2 (0x0000561e12fa6000) libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007fb8d7a09000) libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007fb8d77f7000) libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007fb8d75f3000) libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007fb8d73ec000)

I am not sure what should I add to the nvliblist.conf. Currently it has the following library

libcuda.so
libEGL_installertest.so
libEGL_nvidia.so
libEGL.so
libGLdispatch.so
libGLESv1_CM_nvidia.so
libGLESv1_CM.so
libGLESv2_nvidia.so
libGLESv2.so
libGL.so
libGLX_installertest.so
libGLX_nvidia.so
libglx.so
libGLX.so
libnvcuvid.so
libnvidia-cfg.so
libnvidia-compiler.so
libnvidia-eglcore.so
libnvidia-egl-wayland.so
libnvidia-encode.so
libnvidia-fatbinaryloader.so
libnvidia-fbc.so
libnvidia-glcore.so
libnvidia-glsi.so
libnvidia-gtk2.so
libnvidia-gtk3.so
libnvidia-ifr.so
libnvidia-ml.so
libnvidia-opencl.so
libnvidia-ptxjitcompiler.so
libnvidia-tls.so
libnvidia-wfb.so
libOpenCL.so
libOpenGL.so
libvdpau_nvidia.so
nvidia_drv.so
tls_test_.so
jmstover commented 6 years ago
    libGL.so.1 => /.singularity.d/libs/libGL.so.1 (0x00007fb8d8d48000)
    libGLX.so.0 => /.singularity.d/libs/libGLX.so.0 (0x00007fb8d7efa000)
    libGLdispatch.so.0 => /.singularity.d/libs/libGLdispatch.so.0 (0x00007fb8d7c2c000)

It looks like the main GL libs are pulling from those brought in by the --nv option. The others are the X server libs, which are going to be container dependent.

Have you tried setting export SINGULARITYENV_DISPLAY=${DISPLAY} ? What that will do is set the DISPLAY environment variable inside the container, with the value it is on the host.

Xingyu-Lin commented 6 years ago

I am not sure what you mean by ${DISPLAY}. This variable is an empty string for me... I tried directly set the SINGULARITYENV_DISPLAY with the command and it does not work

jmstover commented 6 years ago

Umm... okay... DISPLAY holds the display of the X server. For instance, on my laptop, the display is:

:0.0

It's layout is basically: [host]:[display][.screen]. That is telling the X server where to send the graphical output. When you have a VirtualGL setup, etc... you'll end up with a DISPLAY being like: :10.0 ... 0.0 is generally the monitor hooked up to the graphics card ... even headless. It's the local display. :10.0 would be an offset display. Another user logged in at the same time could get :12.0, and so on.

The other displays will generally proxy through the local display depending on your setup. You can configure multiple displays in the X configuration, but I highly doubt that is the case here, so generally how this works is you have the system automatically logging into the GUI. The default user is setup to allow connections to the local display, from the local machine.

Each user that spawns off a new session is then assigned a DISPLAY that they write to, but it's rendered on display :0.0 .... which is the local display / hardware.

Xingyu-Lin commented 6 years ago

Thanks for the explanation. I compile the code which is on top of this thread and get the executable. If I run the executable inside the container, I got an error, say that no device is detected and there is no egl display. However, if I run the executable outside of the container, it is able to detect 4 GPUs.

On both inside and outside the container, the DISPLAY is an empty string

jmstover commented 6 years ago

Hrmm... Are you doing a EGL context as well (as from the original post)?

The only thing I can find for that, is nVidia says to link against libOpenGL.so and libEGL.so, but you have it linking against libGLX.so for that context.

What is the full singularity command you're using?

Xingyu-Lin commented 6 years ago

The singularity command I am using is just singularity shell --nv chester/containers/ubuntu-16.04-lts-rl.img

When I print out the shared library used by the executable inside the container using ldd ./test, Here is what I got

linux-vdso.so.1 => (0x00007ffcfe9a8000) libEGL.so.1 => /.singularity.d/libs/libEGL.so.1 (0x00007fdc39b5d000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fdc397ce000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdc39405000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fdc39201000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fdc38ef7000) libGLdispatch.so.0 => /.singularity.d/libs/libGLdispatch.so.0 (0x00007fdc38c29000) /lib64/ld-linux-x86-64.so.2 (0x0000561d48a0c000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fdc38a13000)

And if I do this outside of the container, here is what I got:

linux-vdso.so.1 => (0x00007fffccda3000) libEGL.so.1 => /lib64/libEGL.so.1 (0x00007f34ddf3c000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f34ddc33000) libc.so.6 => /lib64/libc.so.6 (0x00007f34dd870000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f34dd66c000) libm.so.6 => /lib64/libm.so.6 (0x00007f34dd369000) libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00007f34dd09b000) /lib64/ld-linux-x86-64.so.2 (0x000055f0ce115000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f34dce85000)

Outside the container, it works fine. To clarify, the source code used to compile the executable is the same as the original post.

I have compared the two libEGL.so.1 that is used and they are identical...

jmstover commented 6 years ago

Okay, I built that and compiled it ... After an strace on the run, I needed to add in: -B /usr/share/glvnd as an option to bind mount that directory in.

There's a file named: /usr/share/glvnd/egl_vendor.d/10_nvidia.json that it tried opening. That config file doesn't exist in the container, and is based around what's installed on the host.

To verify what your binary is looking for run like:

strace ./egl 2>&1 | less

Then look for lines that contain egl_vendor.d

$ /usr/local/singularity/2.6.0/bin/singularity exec --nv -B /usr/share/glvnd cdash/ ~/tmp/egl
Detected 1 devices
EGL eglDpy: 0x6446f0
minor, major: 4, 1
EGL numConfigs: 1
EGL eglCfg: 0xcaf339
EGL Ctx: 0x649001

Note: The cdash/ sandbox is the only I had with X libraries... :/

Xingyu-Lin commented 6 years ago

Your solution solved my problem! And also thanks a lot for explaining how you find it!

SomeoneSerge commented 6 months ago

I believe (I only tested with apptainer and on NixOS) the (nvliblist-based) --nv still does not detect the host's glvnd configuration. Should we reopen the issue?

DrDaveD commented 6 months ago

This git repository is closed. If you reproduce the problem with singularity-ce open a new issue at https://github.com/sylabs/singularity, otherwise open a new one at https://apptainer/apptainer.