VirtualGL / virtualgl

Main VirtualGL repository
https://VirtualGL.org
Other
692 stars 105 forks source link

Segmentation fault when using VirtualGL EGL backend with cgroup managed GPUs #157

Closed timeu closed 3 years ago

timeu commented 3 years ago

We have a SLURM cluster with several GPU nodes with multiple GPUs (NVIDIA T100, V100 and RTX). The SLURM cluster is configured to constraint the GPU based on the user's resource request using the Cgroups devices subsystem. I installed the 3.x preview brach (2.6.80) of VirtualGL in a singularity container.

When I ssh into the node as root without any SLURM allocation (so I have access to all 4 GPUs) the EGL backend works just fine .

Singularity> nvidia-smi
Mon Feb  8 21:06:51 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   29C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   29C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   30C    P0    26W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   31C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The node has 4 cards installed (Note: /dev/dri/card0 is the primary video display)

Singularity> ls -la /dev/dri/*
crw-rw----. 1 root irc 226,   0 Dec  4 13:08 /dev/dri/card0
crw-rw----. 1 root irc 226,   1 Dec  4 13:08 /dev/dri/card1
crw-rw----. 1 root irc 226,   2 Dec  4 13:08 /dev/dri/card2
crw-rw----. 1 root irc 226,   3 Dec  4 13:08 /dev/dri/card3
crw-rw----. 1 root irc 226,   4 Dec  4 13:08 /dev/dri/card4
crw-rw----. 1 root irc 226, 128 Dec  4 13:08 /dev/dri/renderD128
crw-rw----. 1 root irc 226, 129 Dec  4 13:08 /dev/dri/renderD129
crw-rw----. 1 root irc 226, 130 Dec  4 13:08 /dev/dri/renderD130
crw-rw----. 1 root irc 226, 131 Dec  4 13:08 /dev/dri/renderD131

Running eglinfo on the 4 cards works as exptected

Singularity> /opt/VirtualGL/bin/eglinfo /dev/dri/card1
device: /dev/dri/card1
EGL client APIs string: OpenGL_ES OpenGL
EGL vendor string: NVIDIA
EGL version string: 1.5
display EGL extensions:
    EGL_EXT_buffer_age, EGL_EXT_client_sync,
    ...
    EGL_KHR_platform_gbm, EGL_KHR_platform_wayland, EGL_KHR_platform_x11,
    EGL_MESA_platform_gbm, EGL_MESA_platform_surfaceless
EGL version: 1.5
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 32768 MB
    Total available memory: 32768 MB
    Currently available dedicated video memory: 32503 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
OpenGL version string: 4.6.0 NVIDIA 455.23.05
....

I can run the OpenGL benchmark on all 4 cards just fine.

Next I try to submit an interactive SLURM job and request 1 GPU:

srun  -p g --gres=gpu:1 --pty bash

SLURM will restrict the GPU that I can access using the cgroup device subsystem:

Singularity> nvidia-smi
Mon Feb  8 21:16:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   29C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here I run into the first issue. Which of the /dev/dri/cardX devices does the cgroup allowed device map to ? It seems that this feature request is somewhat related (autoegl). My current workaround is to test all the /dev/dri/cardX devices with eglinfo. The correct device (/dev/dri/card1) will return the above output. If I select another device (i.e. /dev/dri/card2) will return a segmentation fault

Singularity> /opt/VirtualGL/bin/eglinfo /dev/dri/card2
Segmentation fault

If I use the correct device everything works fine (OpenGL benchmark etc). Next I try to run another SLURM allocation using srun -p g --gres=gpu:1 --pty bash. SLURM will allocate the next available GPU card:

[root@clip-g2-3 tmp]# nvidia-smi
Mon Feb  8 22:05:03 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   29C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However now any of the /dev/dri/cardX will give me a segmentation fault when I run eglinfo on it:

Singularity> /opt/VirtualGL/bin/eglinfo /dev/dri/card1
Segmentation fault
Singularity> /opt/VirtualGL/bin/eglinfo /dev/dri/card2
Segmentation fault
Singularity> /opt/VirtualGL/bin/eglinfo /dev/dri/card3
Segmentation fault
Singularity> /opt/VirtualGL/bin/eglinfo /dev/dri/card4
Segmentation fault

Interestingly if I request all 4 GPUs with: srun -p g --gres=gpu:4 --pty bash, all 4 devices work fine.

I guess one workaround is to not use the EGL backend but use the traditional GLX backend but it would nice, if I could get this working using the EGL backend.

timeu commented 3 years ago

I installed the debuginfos and ran /opt/VirtualGL/bin/eglinfo /dev/dri/card2 through GDB. This is the backtrace:

(gdb) run /dev/dri/card2
Starting program: /opt/VirtualGL/bin/eglinfo /dev/dri/card2
Missing separate debuginfos, use: debuginfo-install glibc-2.17-323.el7_9.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5d4fff0 in ?? () from /.singularity.d/libs/libEGL_nvidia.so.0
Missing separate debuginfos, use: debuginfo-install libX11-1.6.7-3.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libdrm-2.4.97-2.el7.x86_64 libxcb-1.13-1.el7.x86_64
(gdb) backtrace
#0  0x00007ffff5d4fff0 in ?? () from /.singularity.d/libs/libEGL_nvidia.so.0
#1  0x0000000000401edc in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/VirtualGL-2.6.80/glxdemos/eglinfo.c:707
(gdb) frame 1
#1  0x0000000000401edc in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/VirtualGL-2.6.80/glxdemos/eglinfo.c:707
707              _eglQueryDeviceStringEXT(devices[i], EGL_DRM_DEVICE_FILE_EXT);

Edit:

To me it looks like this call itearting over all /dev/dri/card* devices and calling https://github.com/VirtualGL/virtualgl/blob/dev/glxdemos/eglinfo.c#L707 will segfault if the device is not acessible if it was denied with the cgroup device subsystem

dcommander commented 3 years ago

The EGL back end feature already has a 60-hour funding deficit, and at the moment, I'm facing the prospect of borrowing against the entire 2021/2022 VirtualGL General Fund just to finish the 3.0 release, including documenting the EGL back end and fixing issues with it. I do not have the resources to dig into this issue and figure out how to reproduce it outside of a SLURM environment. Simply changing the permissions on /dev/dri/card0 doesn't reproduce the issue, I have no clue how to set up cgroups, I have no time to learn how to do that unless someone is paying for that time, and even if I could reproduce the issue, I have no clue how to work around it (assuming such is even possible.)

The more information you can give me, and the easier you can make it for me to repro the issue and understand potential workarounds, the more likely it is to get fixed. At the moment, I don't have enough to work with.

timeu commented 3 years ago

@dcommander: I understand. I will try to create a reproducible small case without SLURM (just using cgroup) and post more information as soon as I have something. It might also not be an issue of the VirtualGL code but maybe a bug in the EGL library of NVIDIA because I tried to run https://github.com/KDAB/eglinfo/blob/master/main.cpp which just enumerates the EGL backends and it also segfaults the same way (as far as I can tell when calling eglGetPlatformDisplayEXT of the libEGL library). I will actually open an issue in https://github.com/NVIDIA/libglvnd and see if they have some information.

dcommander commented 3 years ago

OK, thanks. I strongly suspect that this is an nVidia issue, and frankly, I encountered several such issues already that had to be worked around in the EGL back end. Some couldn't be worked around, which is why the EGL back end can't be used with 415.xx. Unfortunately, there isn't another device-based EGL implementation that I can use to cross-check nVidia's. The AMDGPU driver pretends to support device-based EGL, but it doesn't actually work.

shanerade commented 3 years ago

How does one help fund the 60 hours?

dcommander commented 3 years ago

@shanerade contact me through e-mall (https://virtualgl.org/About/Contact) if you would like to make a large donation. Otherwise, there is a link for small donations on the landing page of https://virtualgl.org.

timeu commented 3 years ago

So I created a reproducible case for NVIDIA driver version: 455.23.05. This on a CentOS Linux release 7.9.2009 (Core) with 3.10.0-1127.19.1.el7.x86_64 kernel.

  1. Create a new cgroup to constraint the devices: mkdir /sys/fs/cgroup/devices/egl_debug
  2. Move own process to the new cgroup: echo $$ > /sys/fs/cgroup/devices/egl_debug/tasks
  3. Remove access to the first NVIDIA GPU: echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
  4. Check with nvidia-smi that GPU is not visible
  5. Running eglinfo on any card will segfault: /opt/VirtualGL/bin/eglinfo /dev/dri/cardX

FYI: On our testsystem where we upgraded the NVIDIA driver to 460.32.03, I don't get any segfault however I get the error message: Error: no EGL devices found

dcommander commented 3 years ago

Reproduced, but I'm not sure what I can do about it. Given that the problem goes away with a newer driver, it was apparently a driver bug. I'm open to suggestions, but I personally have no ideas regarding how to work around this. If eglQueryDeviceStringEXT() will segfault on a particular device, then that device should not have been enumerated by eglQueryDevicesEXT(). It sounds like that is exactly the bug that nVidia fixed.

timeu commented 3 years ago

I am not sure if it is really fixed with a newer driver version. The segfault doesn't happen however as soon as access to one GPU is removed using cgroups /opt/VirtualGL/bin/eglinfo on the EGL device that still accessible will return Error: no EGL devices found. This still could be an NVIDIA driver issue.

In any case we decided to use the traditional GLX backend instead of the EGL backend because with EGL I also have the issue of selecting the correct /dev/dri/cardX device in an HPC system where the user might only have access to a subset of GPUs.

dcommander commented 3 years ago

If you specify VGL_DISPLAY=egl (or pass -d egl to vglrun), VirtualGL will use the first EGL device it encounters. Thus, calling eglQueryDeviceStringEXT() isn't strictly necessary in that case, but eliminating that call when VGL_DISPLAY==egl wouldn't necessarily work around the bug. If the drivers are returning inaccessible devices from the call to eglQueryDevicesEXT(), then one of those inaccessible devices might be the first device returned.

However, let's try something.

--- a/glxdemos/eglinfo.c
+++ b/glxdemos/eglinfo.c
@@ -702,9 +702,22 @@ main(int argc, char *argv[])
       fprintf(stderr, "Error: eglQueryDeviceStringEXT() could not be loaded");
       return -1;
    }
+   _eglGetPlatformDisplayEXT =
+      (PFNEGLGETPLATFORMDISPLAYEXTPROC)eglGetProcAddress("eglGetPlatformDisplayEXT");
+   if (!_eglGetPlatformDisplayEXT) {
+      fprintf(stderr, "Error: eglGetPlatformDisplayEXT() could not be loaded\n");
+      return -1;
+   }
    for (i = 0; i < numDevices; i++) {
-      const char *devStr =
-         _eglQueryDeviceStringEXT(devices[i], EGL_DRM_DEVICE_FILE_EXT);
+      const char *devStr;
+
+      edpy = _eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, devices[i],
+                                       NULL);
+      if (!edpy || !eglInitialize(edpy, &major, &minor))
+         continue;
+      eglTerminate(edpy);
+      devStr = _eglQueryDeviceStringEXT(devices[i], EGL_DRM_DEVICE_FILE_EXT);
+      fprintf(stderr, "Device %d = %s\n", i, devStr);
       if (devStr && !strcmp(devStr, opts.displayName))
          break;
    }
@@ -713,12 +726,6 @@ main(int argc, char *argv[])
       free(devices);
       return -1;
    }
-   _eglGetPlatformDisplayEXT =
-      (PFNEGLGETPLATFORMDISPLAYEXTPROC)eglGetProcAddress("eglGetPlatformDisplayEXT");
-   if (!_eglGetPlatformDisplayEXT) {
-      fprintf(stderr, "Error: eglGetPlatformDisplayEXT() could not be loaded\n");
-      return -1;
-   }
    edpy = _eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, devices[i], NULL);
    if (!edpy) {
       fprintf(stderr, "Error: unable to open EGL display\n");

Unless I miss my guess, this should filter out the inaccessible devices.

dcommander commented 3 years ago

I can add that same code to the VGL faker, if it effectively works around the problem from your point of view. That should make VGL_DISPLAY=egl work properly for the case in which only one GPU is accessible. If multiple GPUs are accessible, then you'll still have to select one somehow, but you would have to do that with the GLX back end as well.

timeu commented 3 years ago

@dcommander thanks for looking into this. I haven't really vglrun with cgroup constraints GPUs only eglinfo. Is the code path the same ? I am also suprised that running /opt/VirtualGL/bin/eglinfo on the /dev/dri/cardX device tht should still be accessible I still get the Error: no EGL devices found. Does it also iterare over all EGL devices even if I specify a EGL backend ?

I can definetely try the patch and also will try if vglrun.

dcommander commented 3 years ago

@dcommander thanks for looking into this. I haven't really vglrun with cgroup constraints GPUs only eglinfo. Is the code path the same ?

The VirtualGL Faker and eglinfo do not share the same literal code, but the algorithm they use to scan for an EGL device is the same. Thus, if we can make eglinfo work properly, then I can port the same changes into the faker.

I am also suprised that running /opt/VirtualGL/bin/eglinfo on the /dev/dri/cardX device tht should still be accessible I still get the Error: no EGL devices found. Does it also iterare over all EGL devices even if I specify a EGL backend ?

Here is the relevant code: https://github.com/VirtualGL/virtualgl/blob/dev/glxdemos/eglinfo.c#L678-L732 https://github.com/VirtualGL/virtualgl/blob/dev/server/faker.cpp#L198-L222

The answer to your question is multi-pronged:

Now as to why the newer driver isn't exposing the device that should be accessible, I'm not sure. That error message means that eglinfo iterated over all of the devices returned by eglQueryDevicesEXT() and did not find one that matches the name you specified.

At this point, I need you to test the patch and report on its behavior under various scenarios. I don't expect that the patch will be a full solution, but I need to see how the behavior changes in order to move forward. Unfortunately, I don't have a multi-GPU system at my disposal, so I can't fully reproduce all of the issues you are reporting. I can only reproduce the issue with a single device.

dcommander commented 3 years ago

Unfortunately, I am stuck without your help. I was able to successfully configure one of my machines to use two GPUs, an AMD Radeon Pro WX2100 using the amdgpu driver and an nVidia Quadro 600 using the nouveau driver (NOTE: I can't use the nVidia proprietary driver with this GPU because it's too old.) Both work with device-based EGL, but unfortunately, the cgroups trick doesn't work. I tried using devices 29:0 and 29:1, which correspond to /dev/fb0 and /dev/fb1, but the GPUs were both still accessible. Since I only have one nVidia GPU that is supported by the current drivers, I can't reproduce the multi-gpu aspect of this issue, and that seems critical to solving the problem.

timeu commented 3 years ago

@dcommander Thanks for the detailed explanation. That makes thing clearer. I am happy to test the patch on our HPC system and see how it behaves. I will try to do this today.

For HPC systems where users might request an x number of GPUs on a node with multiple GPUs (we have nodes with 4 and 8 GPUs), the batch scheduler (in our case SLURM) will use cgroups to give the access to one (or more) GPUs on a node where the user's job is scheduled. Apart from the above issue (segfault in old drivers and not finding the right EGL device on the new drivers), the EGL backend/mode forces us to also find out the corresponding/right EGL backend for the GPU that the user has access to. I wasn't aware about the VGL_DISPLAY=egl solution but that would be perfect for an HPC system (typically the user will only request a single GPU for the various visualization applications that they use). Our current solution is to create a static /etc/X11/xorg.conf file that has configured one screen for each of the GPUs on the node and when the user submits a job that requires OpenGL (typically using xpra) we start an X11 server beforehand using the above configuraton and X11 will use the GPU it has access to (basically it just works). It would be great if we could get the same behavior with the EGL backend because (it seems what the VGL_DISPLAY=egl is doing already).

timeu commented 3 years ago

@dcommander : I re-compiled the dev branch with the above patch applied. Without cgroup constraints GPUs it works fine:

[root@stg-g1-0 virtualgl]# bin/eglinfo -B /dev/dri/card0
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
Device 3 = /dev/dri/card4
Device 4 = /dev/dri/card5
Device 5 = /dev/dri/card6
Device 6 = /dev/dri/card7
Device 7 = /dev/dri/card8
Error: invalid EGL device

[root@stg-g1-0 virtualgl]# bin/eglinfo -B /dev/dri/card1
Device 0 = /dev/dri/card1
device: /dev/dri/card1
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 12288 MB
    Total available memory: 12288 MB
    Currently available dedicated video memory: 12194 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
OpenGL version string: 4.6.0 NVIDIA 460.32.03
OpenGL shading language version string: 4.60 NVIDIA

However as soon as I constrain it to a single GPU with I get Error: no EGL devices found I will try to add some print statement and see why it's bailing unless you have a better idea how to best debug it (maybe using gdb ? )

dcommander commented 3 years ago

Let's please focus on one thing at a time. I am first trying to work around the segfault in 455.xx. Then we can discuss how to work around the other issue in 460.xx.

timeu commented 3 years ago

@dcommander: Sorry for the confusion. So I tested your patch against the 455.23.05 driver version and indeed the patch fixes the segfault error:

Unpatched version:

[root@clip-g2-2]# /groups/it/uemit/virtualgl_orig/bin/eglinfo /dev/dri/card2 -B
Segmentation fault

Patched version:

[root@clip-g2-2]# /groups/it/uemit/virtualgl/bin/eglinfo /dev/dri/card2 -B
Device 1 = /dev/dri/card2
device: /dev/dri/card2
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 32768 MB
    Total available memory: 32768 MB
    Currently available dedicated video memory: 32503 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
OpenGL version string: 4.6.0 NVIDIA 455.23.05
OpenGL shading language version string: 4.60 NVIDIA
dcommander commented 3 years ago

Can you please test it methodically? I really need to know how or if the patch works under the following scenarios:

  1. With both devices enabled, try selecting each device with eglinfo. Both should work.
  2. With the first device enabled, try selecting each device with eglinfo. The first should work, and the second should abort with Invalid EGL device.
  3. With the second device enabled, try selecting each device with eglinfo. The second should work, and the first should abort with Invalid EGL device.
  4. With both devices disabled, try selecting each device with eglinfo. Both should abort with Invalid EGL device.

Once we have verified the correct behavior with 455.xx in all scenarios above, then I will check in the patch, and we can move on to 460.xx.

timeu commented 3 years ago

The node has 4 GPUs, so I tested following scenarios:

1.) No cgroup setup (all 4 GPUs accessible):

[root@clip-g2-2 ~]# nvidia-smi
Mon Mar  1 18:49:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   30C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   38C    P0    44W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   29C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
Device 0 = /dev/dri/card1
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
Device 3 = /dev/dri/card4
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
  1. Only first device enabled:
    
    [root@clip-g2-2 bin]# echo $$ > /sys/fs/cgroup/devices/egl_debug/tasks
    [root@clip-g2-2 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow
    [root@clip-g2-2 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
    [root@clip-g2-2 bin]# echo "c 195:2 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
    [root@clip-g2-2 bin]# echo "c 195:3 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny

[root@clip-g2-2 bin]# nvidia-smi Mon Mar 1 19:02:45 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:00:06.0 Off | 0 | | N/A 30C P0 25W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Device 0 = /dev/dri/card1 OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Error: invalid EGL device Device 0 = /dev/dri/card1 Error: invalid EGL device Device 0 = /dev/dri/card1 Error: invalid EGL device

3.) Only second device enabled

[root@clip-g2-2 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow [root@clip-g2-2 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny

[root@clip-g2-2 bin]# nvidia-smi Mon Mar 1 19:04:31 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 | | N/A 38C P0 44W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Device 1 = /dev/dri/card2 Error: invalid EGL device Device 1 = /dev/dri/card2 OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2 Device 1 = /dev/dri/card2 Error: invalid EGL device Device 1 = /dev/dri/card2 Error: invalid EGL device


4.) No device enabled

[root@clip-g2-2 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny

[root@clip-g2-2 bin]# nvidia-smi No devices were found

[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Error: invalid EGL device Error: invalid EGL device Error: invalid EGL device Error: invalid EGL device



So it seems that the patch works as intended. If I need to do any additional tests, let me know
dcommander commented 3 years ago

Perfect. Now can you repeat the same analysis with 460.xx?

timeu commented 3 years ago

Ok here is the test for 460.xx driver (note this are slightly different GPU nodes with 8 GPUs instead of 4 and different type T100 vs V100. However I don't think this should make a difference for our tests):

1.) No cgroup setup (all 8 GPUs accessible, showing only the first 4):

[root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
Device 0 = /dev/dri/card1
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
Device 3 = /dev/dri/card4
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
  1. Only first device enabled:
[root@stg-g1-0 bin]# echo $$ > /sys/fs/cgroup/devices/egl_debug/tasks
[root@stg-g1-0 bin]# for i in {0..7}; do echo "c 195:$i rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny; done
[root@stg-g1-0 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow

[root@stg-g1-0 bin]# nvidia-smi
Mon Mar  1 19:43:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   26C    P0    24W / 250W |      0MiB / 12198MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
Error: no EGL devices found
Error: no EGL devices found
Error: no EGL devices found
Error: no EGL devices found
  1. Only second device enabled

    [root@stg-g1-0 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow
    [root@stg-g1-0 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
    [root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
    Error: no EGL devices found
    Error: no EGL devices found
    Error: no EGL devices found
    Error: no EGL devices found
  2. No device enabled:

    
    [root@stg-g1-0 bin]# for i in {0..7}; do echo "c 195:$i rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny; done

[root@stg-g1-0 bin]# nvidia-smi No devices were found

[root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Error: no EGL devices found Error: no EGL devices found Error: no EGL devices found Error: no EGL devices found


5. All devices enabled:

[root@stg-g1-0 bin]# for i in {0..7}; do echo "c 195:$i rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow; done

[root@stg-g1-0 bin]# for i in {1..8}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Device 0 = /dev/dri/card1 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 Device 5 = /dev/dri/card6 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 Device 5 = /dev/dri/card6 Device 6 = /dev/dri/card7 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 Device 5 = /dev/dri/card6 Device 6 = /dev/dri/card7 Device 7 = /dev/dri/card8 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2



So for some reason the behavior changed in the newer driver release. It seems that as soon as 1 card is not enabled, nothing is returned.
dcommander commented 3 years ago

Thanks for testing. Your results suggest that nVidia either doesn't properly support cgroups or doesn't test them very well. In 455.xx, eglQueryDevicesEXT() returns devices that are inaccessible and segfaults when one of those devices is passed to eglQueryDeviceStringEXT(). In 460.xx, eglQueryDevicesEXT() returns no devices if only one of them is inaccessible.

I ordered a Quadro P620, which is the lowest-cost nVidia GPU that supports the latest drivers. Once that arrives later this week, I will have the ability to do multi-GPU testing in house with the latest nVidia proprietary drivers. I want to test various revisions of those drivers as far back as 390.xx and see exactly when the segfault was introduced. I also want to experiment with other ways of working around the issue. I'll keep you posted. Until I have a complete picture of the issue across all EGL-supporting driver revisions, I won't have a good sense of whether it makes sense to push the patch. It probably makes more sense to take up this issue with nVidia and try to get a proper fix from them.

timeu commented 3 years ago

@dcommander Thanks for all the effort. If you need me to do any more testing let me know. In the meantime I will try to open a ticket https://github.com/NVIDIA/libglvnd here and NVIDIA can provide some more insight.

dcommander commented 3 years ago

That GitHub project is just for GLVND, not for the nVidia proprietary drivers. You probably need to go through nVidia's tech support channels or try posting here: https://forums.developer.nvidia.com/c/gpu-unix-graphics/linux/148.

dcommander commented 3 years ago

Results:

390.xx

Without patch: eglinfo fails with "Could not initialize EGL" if passed a cgroup-blacklisted device. More specifically, eglQueryDevicesEXT() returns all devices, including cgroup-blacklisted devices, and eglQueryDeviceStringEXT() works properly for all devices regardless of whether they are blacklisted. With patch: eglinfo fails with "Invalid EGL device" if passed a cgroup-blacklisted device.

418.xx-450.xx inclusive

Without patch: eglinfo segfaults if passed a cgroup-blacklisted device. More specifically, eglQueryDevicesEXT() returns all devices, including cgroup-blacklisted devices, and eglQueryDeviceStringEXT() segfaults when passed a cgroup-blacklisted device. With patch: eglinfo fails with "Invalid EGL device" if passed a cgroup-blacklisted device.

460.xx

Without patch: eglinfo fails with "No EGL devices found" if any device is cgroup-blacklisted, regardless of which device is passed to eglinfo. More specifically, eglQueryDevicesEXT() returns no devices if any device is cgroup-blacklisted. With patch: The behavior is the same as without the patch.

Unfortunately, I don't see any way to make the EGL back end work with cgroups and 460.xx. Nothing in the 460.xx change log suggested why it broke, so my guess is that nVidia doesn't even test cgroups. Given that I was unable to blacklist regular **/dev/fb*** devices (including an nVidia GPU using nouveau), I have my doubts whether this is even supposed to work in any official capacity.

If your organization is willing to use 450.xx, then I am willing to push the patch. Otherwise, I would suggest pursuing the matter with nVidia.

dcommander commented 3 years ago

I have pushed the commit, since it at least allows cgroups to be used with the EGL back end and v450.xx and earlier. That's the limit of what I can do at the moment.

ehfd commented 3 years ago

I normally use the shell script below to choose devices, but I think this might be a problem with how the SLURM Generic RESource (GRES) management handles cgroups. Pyxis (https://github.com/NVIDIA/pyxis) might make things a bit better, or not.

From https://github.com/ehfd/docker-nvidia-egl-desktop/blob/main/bootstrap.sh

printf "3\nn\nx\n" | sudo /opt/VirtualGL/bin/vglserver_config

for DRM in /dev/dri/card*; do
  if /opt/VirtualGL/bin/eglinfo "$DRM"; then
    export VGL_DISPLAY="$DRM"
    break
  fi
done
dcommander commented 3 years ago

@ehfd The commit I pushed should have effectively worked around the problem for 450.xx and earlier, but the issue with 460.xx is that EGL returns no available devices if any device is cgroup-blacklisted. I fail to see your solution works around that.