Closed timeu closed 3 years ago
I installed the debuginfos and ran /opt/VirtualGL/bin/eglinfo /dev/dri/card2
through GDB. This is the backtrace:
(gdb) run /dev/dri/card2
Starting program: /opt/VirtualGL/bin/eglinfo /dev/dri/card2
Missing separate debuginfos, use: debuginfo-install glibc-2.17-323.el7_9.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5d4fff0 in ?? () from /.singularity.d/libs/libEGL_nvidia.so.0
Missing separate debuginfos, use: debuginfo-install libX11-1.6.7-3.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libdrm-2.4.97-2.el7.x86_64 libxcb-1.13-1.el7.x86_64
(gdb) backtrace
#0 0x00007ffff5d4fff0 in ?? () from /.singularity.d/libs/libEGL_nvidia.so.0
#1 0x0000000000401edc in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/VirtualGL-2.6.80/glxdemos/eglinfo.c:707
(gdb) frame 1
#1 0x0000000000401edc in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/VirtualGL-2.6.80/glxdemos/eglinfo.c:707
707 _eglQueryDeviceStringEXT(devices[i], EGL_DRM_DEVICE_FILE_EXT);
Edit:
To me it looks like this call itearting over all /dev/dri/card* devices and calling https://github.com/VirtualGL/virtualgl/blob/dev/glxdemos/eglinfo.c#L707 will segfault if the device is not acessible if it was denied with the cgroup device subsystem
The EGL back end feature already has a 60-hour funding deficit, and at the moment, I'm facing the prospect of borrowing against the entire 2021/2022 VirtualGL General Fund just to finish the 3.0 release, including documenting the EGL back end and fixing issues with it. I do not have the resources to dig into this issue and figure out how to reproduce it outside of a SLURM environment. Simply changing the permissions on /dev/dri/card0 doesn't reproduce the issue, I have no clue how to set up cgroups, I have no time to learn how to do that unless someone is paying for that time, and even if I could reproduce the issue, I have no clue how to work around it (assuming such is even possible.)
The more information you can give me, and the easier you can make it for me to repro the issue and understand potential workarounds, the more likely it is to get fixed. At the moment, I don't have enough to work with.
@dcommander: I understand.
I will try to create a reproducible small case without SLURM (just using cgroup) and post more information as soon as I have something.
It might also not be an issue of the VirtualGL code but maybe a bug in the EGL library of NVIDIA because I tried to run https://github.com/KDAB/eglinfo/blob/master/main.cpp which just enumerates the EGL backends and it also segfaults the same way (as far as I can tell when calling eglGetPlatformDisplayEXT
of the libEGL library). I will actually open an issue in https://github.com/NVIDIA/libglvnd and see if they have some information.
OK, thanks. I strongly suspect that this is an nVidia issue, and frankly, I encountered several such issues already that had to be worked around in the EGL back end. Some couldn't be worked around, which is why the EGL back end can't be used with 415.xx. Unfortunately, there isn't another device-based EGL implementation that I can use to cross-check nVidia's. The AMDGPU driver pretends to support device-based EGL, but it doesn't actually work.
How does one help fund the 60 hours?
@shanerade contact me through e-mall (https://virtualgl.org/About/Contact) if you would like to make a large donation. Otherwise, there is a link for small donations on the landing page of https://virtualgl.org.
So I created a reproducible case for NVIDIA driver version: 455.23.05. This on a CentOS Linux release 7.9.2009 (Core) with 3.10.0-1127.19.1.el7.x86_64 kernel.
mkdir /sys/fs/cgroup/devices/egl_debug
echo $$ > /sys/fs/cgroup/devices/egl_debug/tasks
echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
nvidia-smi
that GPU is not visible/opt/VirtualGL/bin/eglinfo /dev/dri/cardX
FYI:
On our testsystem where we upgraded the NVIDIA driver to 460.32.03, I don't get any segfault however I get the error message: Error: no EGL devices found
Reproduced, but I'm not sure what I can do about it. Given that the problem goes away with a newer driver, it was apparently a driver bug. I'm open to suggestions, but I personally have no ideas regarding how to work around this. If eglQueryDeviceStringEXT()
will segfault on a particular device, then that device should not have been enumerated by eglQueryDevicesEXT()
. It sounds like that is exactly the bug that nVidia fixed.
I am not sure if it is really fixed with a newer driver version.
The segfault doesn't happen however as soon as access to one GPU is removed using cgroups /opt/VirtualGL/bin/eglinfo
on the EGL device that still accessible will return Error: no EGL devices found
.
This still could be an NVIDIA driver issue.
In any case we decided to use the traditional GLX backend instead of the EGL backend because with EGL I also have the issue of selecting the correct /dev/dri/cardX
device in an HPC system where the user might only have access to a subset of GPUs.
If you specify VGL_DISPLAY=egl
(or pass -d egl
to vglrun
), VirtualGL will use the first EGL device it encounters. Thus, calling eglQueryDeviceStringEXT()
isn't strictly necessary in that case, but eliminating that call when VGL_DISPLAY==egl
wouldn't necessarily work around the bug. If the drivers are returning inaccessible devices from the call to eglQueryDevicesEXT()
, then one of those inaccessible devices might be the first device returned.
However, let's try something.
--- a/glxdemos/eglinfo.c
+++ b/glxdemos/eglinfo.c
@@ -702,9 +702,22 @@ main(int argc, char *argv[])
fprintf(stderr, "Error: eglQueryDeviceStringEXT() could not be loaded");
return -1;
}
+ _eglGetPlatformDisplayEXT =
+ (PFNEGLGETPLATFORMDISPLAYEXTPROC)eglGetProcAddress("eglGetPlatformDisplayEXT");
+ if (!_eglGetPlatformDisplayEXT) {
+ fprintf(stderr, "Error: eglGetPlatformDisplayEXT() could not be loaded\n");
+ return -1;
+ }
for (i = 0; i < numDevices; i++) {
- const char *devStr =
- _eglQueryDeviceStringEXT(devices[i], EGL_DRM_DEVICE_FILE_EXT);
+ const char *devStr;
+
+ edpy = _eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, devices[i],
+ NULL);
+ if (!edpy || !eglInitialize(edpy, &major, &minor))
+ continue;
+ eglTerminate(edpy);
+ devStr = _eglQueryDeviceStringEXT(devices[i], EGL_DRM_DEVICE_FILE_EXT);
+ fprintf(stderr, "Device %d = %s\n", i, devStr);
if (devStr && !strcmp(devStr, opts.displayName))
break;
}
@@ -713,12 +726,6 @@ main(int argc, char *argv[])
free(devices);
return -1;
}
- _eglGetPlatformDisplayEXT =
- (PFNEGLGETPLATFORMDISPLAYEXTPROC)eglGetProcAddress("eglGetPlatformDisplayEXT");
- if (!_eglGetPlatformDisplayEXT) {
- fprintf(stderr, "Error: eglGetPlatformDisplayEXT() could not be loaded\n");
- return -1;
- }
edpy = _eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, devices[i], NULL);
if (!edpy) {
fprintf(stderr, "Error: unable to open EGL display\n");
Unless I miss my guess, this should filter out the inaccessible devices.
I can add that same code to the VGL faker, if it effectively works around the problem from your point of view. That should make VGL_DISPLAY=egl
work properly for the case in which only one GPU is accessible. If multiple GPUs are accessible, then you'll still have to select one somehow, but you would have to do that with the GLX back end as well.
@dcommander thanks for looking into this.
I haven't really vglrun
with cgroup constraints GPUs only eglinfo
. Is the code path the same ?
I am also suprised that running /opt/VirtualGL/bin/eglinfo
on the /dev/dri/cardX
device tht should still be accessible I still get the Error: no EGL devices found
. Does it also iterare over all EGL devices even if I specify a EGL backend ?
I can definetely try the patch and also will try if vglrun.
@dcommander thanks for looking into this. I haven't really
vglrun
with cgroup constraints GPUs onlyeglinfo
. Is the code path the same ?
The VirtualGL Faker and eglinfo
do not share the same literal code, but the algorithm they use to scan for an EGL device is the same. Thus, if we can make eglinfo
work properly, then I can port the same changes into the faker.
I am also suprised that running
/opt/VirtualGL/bin/eglinfo
on the/dev/dri/cardX
device tht should still be accessible I still get theError: no EGL devices found
. Does it also iterare over all EGL devices even if I specify a EGL backend ?
Here is the relevant code: https://github.com/VirtualGL/virtualgl/blob/dev/glxdemos/eglinfo.c#L678-L732 https://github.com/VirtualGL/virtualgl/blob/dev/server/faker.cpp#L198-L222
The answer to your question is multi-pronged:
egl
using VGL_DISPLAY
or vglrun -d
.eglQueryDevicesEXT()
until it finds a device matching the name you specified. If it doesn't find one, it aborts with "Invalid EGL device."egl
, then the faker chooses the first device returned by eglQueryDevicesEXT()
eglinfo
:
eglinfo
does not currently have an equivalent to VGL_DISPLAY=egl
.Now as to why the newer driver isn't exposing the device that should be accessible, I'm not sure. That error message means that eglinfo
iterated over all of the devices returned by eglQueryDevicesEXT()
and did not find one that matches the name you specified.
At this point, I need you to test the patch and report on its behavior under various scenarios. I don't expect that the patch will be a full solution, but I need to see how the behavior changes in order to move forward. Unfortunately, I don't have a multi-GPU system at my disposal, so I can't fully reproduce all of the issues you are reporting. I can only reproduce the issue with a single device.
Unfortunately, I am stuck without your help. I was able to successfully configure one of my machines to use two GPUs, an AMD Radeon Pro WX2100 using the amdgpu driver and an nVidia Quadro 600 using the nouveau driver (NOTE: I can't use the nVidia proprietary driver with this GPU because it's too old.) Both work with device-based EGL, but unfortunately, the cgroups trick doesn't work. I tried using devices 29:0 and 29:1, which correspond to /dev/fb0 and /dev/fb1, but the GPUs were both still accessible. Since I only have one nVidia GPU that is supported by the current drivers, I can't reproduce the multi-gpu aspect of this issue, and that seems critical to solving the problem.
@dcommander Thanks for the detailed explanation. That makes thing clearer. I am happy to test the patch on our HPC system and see how it behaves. I will try to do this today.
For HPC systems where users might request an x number of GPUs on a node with multiple GPUs (we have nodes with 4 and 8 GPUs), the batch scheduler (in our case SLURM) will use cgroups to give the access to one (or more) GPUs on a node where the user's job is scheduled. Apart from the above issue (segfault in old drivers and not finding the right EGL device on the new drivers), the EGL backend/mode forces us to also find out the corresponding/right EGL backend for the GPU that the user has access to. I wasn't aware about the VGL_DISPLAY=egl
solution but that would be perfect for an HPC system (typically the user will only request a single GPU for the various visualization applications that they use).
Our current solution is to create a static /etc/X11/xorg.conf
file that has configured one screen for each of the GPUs on the node and when the user submits a job that requires OpenGL (typically using xpra) we start an X11 server beforehand using the above configuraton and X11 will use the GPU it has access to (basically it just works).
It would be great if we could get the same behavior with the EGL backend because (it seems what the VGL_DISPLAY=egl
is doing already).
@dcommander : I re-compiled the dev branch with the above patch applied. Without cgroup constraints GPUs it works fine:
[root@stg-g1-0 virtualgl]# bin/eglinfo -B /dev/dri/card0
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
Device 3 = /dev/dri/card4
Device 4 = /dev/dri/card5
Device 5 = /dev/dri/card6
Device 6 = /dev/dri/card7
Device 7 = /dev/dri/card8
Error: invalid EGL device
[root@stg-g1-0 virtualgl]# bin/eglinfo -B /dev/dri/card1
Device 0 = /dev/dri/card1
device: /dev/dri/card1
Memory info (GL_NVX_gpu_memory_info):
Dedicated video memory: 12288 MB
Total available memory: 12288 MB
Currently available dedicated video memory: 12194 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
OpenGL version string: 4.6.0 NVIDIA 460.32.03
OpenGL shading language version string: 4.60 NVIDIA
However as soon as I constrain it to a single GPU with I get Error: no EGL devices found
I will try to add some print statement and see why it's bailing unless you have a better idea how to best debug it (maybe using gdb ? )
Let's please focus on one thing at a time. I am first trying to work around the segfault in 455.xx. Then we can discuss how to work around the other issue in 460.xx.
@dcommander: Sorry for the confusion. So I tested your patch against the 455.23.05 driver version and indeed the patch fixes the segfault error:
Unpatched version:
[root@clip-g2-2]# /groups/it/uemit/virtualgl_orig/bin/eglinfo /dev/dri/card2 -B
Segmentation fault
Patched version:
[root@clip-g2-2]# /groups/it/uemit/virtualgl/bin/eglinfo /dev/dri/card2 -B
Device 1 = /dev/dri/card2
device: /dev/dri/card2
Memory info (GL_NVX_gpu_memory_info):
Dedicated video memory: 32768 MB
Total available memory: 32768 MB
Currently available dedicated video memory: 32503 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
OpenGL version string: 4.6.0 NVIDIA 455.23.05
OpenGL shading language version string: 4.60 NVIDIA
Can you please test it methodically? I really need to know how or if the patch works under the following scenarios:
eglinfo
. Both should work.eglinfo
. The first should work, and the second should abort with Invalid EGL device
.eglinfo
. The second should work, and the first should abort with Invalid EGL device
.eglinfo
. Both should abort with Invalid EGL device
.Once we have verified the correct behavior with 455.xx in all scenarios above, then I will check in the patch, and we can move on to 460.xx.
The node has 4 GPUs, so I tested following scenarios:
1.) No cgroup setup (all 4 GPUs accessible):
[root@clip-g2-2 ~]# nvidia-smi
Mon Mar 1 18:49:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:00:06.0 Off | 0 |
| N/A 30C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 |
| N/A 38C P0 44W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... On | 00000000:00:08.0 Off | 0 |
| N/A 29C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:00:09.0 Off | 0 |
| N/A 32C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
Device 0 = /dev/dri/card1
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
Device 3 = /dev/dri/card4
OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2
[root@clip-g2-2 bin]# echo $$ > /sys/fs/cgroup/devices/egl_debug/tasks
[root@clip-g2-2 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow
[root@clip-g2-2 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
[root@clip-g2-2 bin]# echo "c 195:2 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
[root@clip-g2-2 bin]# echo "c 195:3 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
[root@clip-g2-2 bin]# nvidia-smi Mon Mar 1 19:02:45 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:00:06.0 Off | 0 | | N/A 30C P0 25W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Device 0 = /dev/dri/card1 OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Error: invalid EGL device Device 0 = /dev/dri/card1 Error: invalid EGL device Device 0 = /dev/dri/card1 Error: invalid EGL device
3.) Only second device enabled
[root@clip-g2-2 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow [root@clip-g2-2 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
[root@clip-g2-2 bin]# nvidia-smi Mon Mar 1 19:04:31 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 | | N/A 38C P0 44W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Device 1 = /dev/dri/card2 Error: invalid EGL device Device 1 = /dev/dri/card2 OpenGL renderer string: Tesla V100-PCIE-32GB/PCIe/SSE2 Device 1 = /dev/dri/card2 Error: invalid EGL device Device 1 = /dev/dri/card2 Error: invalid EGL device
4.) No device enabled
[root@clip-g2-2 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
[root@clip-g2-2 bin]# nvidia-smi No devices were found
[root@clip-g2-2 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Error: invalid EGL device Error: invalid EGL device Error: invalid EGL device Error: invalid EGL device
So it seems that the patch works as intended. If I need to do any additional tests, let me know
Perfect. Now can you repeat the same analysis with 460.xx?
Ok here is the test for 460.xx driver (note this are slightly different GPU nodes with 8 GPUs instead of 4 and different type T100 vs V100. However I don't think this should make a difference for our tests):
1.) No cgroup setup (all 8 GPUs accessible, showing only the first 4):
[root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
Device 0 = /dev/dri/card1
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
Device 0 = /dev/dri/card1
Device 1 = /dev/dri/card2
Device 2 = /dev/dri/card3
Device 3 = /dev/dri/card4
OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
[root@stg-g1-0 bin]# echo $$ > /sys/fs/cgroup/devices/egl_debug/tasks
[root@stg-g1-0 bin]# for i in {0..7}; do echo "c 195:$i rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny; done
[root@stg-g1-0 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow
[root@stg-g1-0 bin]# nvidia-smi
Mon Mar 1 19:43:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:00:06.0 Off | 0 |
| N/A 26C P0 24W / 250W | 0MiB / 12198MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
Error: no EGL devices found
Error: no EGL devices found
Error: no EGL devices found
Error: no EGL devices found
Only second device enabled
[root@stg-g1-0 bin]# echo "c 195:1 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow
[root@stg-g1-0 bin]# echo "c 195:0 rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny
[root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done
Error: no EGL devices found
Error: no EGL devices found
Error: no EGL devices found
Error: no EGL devices found
No device enabled:
[root@stg-g1-0 bin]# for i in {0..7}; do echo "c 195:$i rwm" > /sys/fs/cgroup/devices/egl_debug/devices.deny; done
[root@stg-g1-0 bin]# nvidia-smi No devices were found
[root@stg-g1-0 bin]# for i in {1..4}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Error: no EGL devices found Error: no EGL devices found Error: no EGL devices found Error: no EGL devices found
5. All devices enabled:
[root@stg-g1-0 bin]# for i in {0..7}; do echo "c 195:$i rwm" > /sys/fs/cgroup/devices/egl_debug/devices.allow; done
[root@stg-g1-0 bin]# for i in {1..8}; do ./eglinfo -B /dev/dri/card$i | grep Tesla; done Device 0 = /dev/dri/card1 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 Device 5 = /dev/dri/card6 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 Device 5 = /dev/dri/card6 Device 6 = /dev/dri/card7 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2 Device 0 = /dev/dri/card1 Device 1 = /dev/dri/card2 Device 2 = /dev/dri/card3 Device 3 = /dev/dri/card4 Device 4 = /dev/dri/card5 Device 5 = /dev/dri/card6 Device 6 = /dev/dri/card7 Device 7 = /dev/dri/card8 OpenGL renderer string: Tesla P100-PCIE-12GB/PCIe/SSE2
So for some reason the behavior changed in the newer driver release. It seems that as soon as 1 card is not enabled, nothing is returned.
Thanks for testing. Your results suggest that nVidia either doesn't properly support cgroups or doesn't test them very well. In 455.xx, eglQueryDevicesEXT()
returns devices that are inaccessible and segfaults when one of those devices is passed to eglQueryDeviceStringEXT()
. In 460.xx, eglQueryDevicesEXT()
returns no devices if only one of them is inaccessible.
I ordered a Quadro P620, which is the lowest-cost nVidia GPU that supports the latest drivers. Once that arrives later this week, I will have the ability to do multi-GPU testing in house with the latest nVidia proprietary drivers. I want to test various revisions of those drivers as far back as 390.xx and see exactly when the segfault was introduced. I also want to experiment with other ways of working around the issue. I'll keep you posted. Until I have a complete picture of the issue across all EGL-supporting driver revisions, I won't have a good sense of whether it makes sense to push the patch. It probably makes more sense to take up this issue with nVidia and try to get a proper fix from them.
@dcommander Thanks for all the effort. If you need me to do any more testing let me know. In the meantime I will try to open a ticket https://github.com/NVIDIA/libglvnd here and NVIDIA can provide some more insight.
That GitHub project is just for GLVND, not for the nVidia proprietary drivers. You probably need to go through nVidia's tech support channels or try posting here: https://forums.developer.nvidia.com/c/gpu-unix-graphics/linux/148.
Results:
Without patch: eglinfo
fails with "Could not initialize EGL" if passed a cgroup-blacklisted device. More specifically, eglQueryDevicesEXT()
returns all devices, including cgroup-blacklisted devices, and eglQueryDeviceStringEXT()
works properly for all devices regardless of whether they are blacklisted.
With patch: eglinfo
fails with "Invalid EGL device" if passed a cgroup-blacklisted device.
Without patch: eglinfo
segfaults if passed a cgroup-blacklisted device. More specifically, eglQueryDevicesEXT()
returns all devices, including cgroup-blacklisted devices, and eglQueryDeviceStringEXT()
segfaults when passed a cgroup-blacklisted device.
With patch: eglinfo
fails with "Invalid EGL device" if passed a cgroup-blacklisted device.
Without patch: eglinfo
fails with "No EGL devices found" if any device is cgroup-blacklisted, regardless of which device is passed to eglinfo
. More specifically, eglQueryDevicesEXT()
returns no devices if any device is cgroup-blacklisted.
With patch: The behavior is the same as without the patch.
Unfortunately, I don't see any way to make the EGL back end work with cgroups and 460.xx. Nothing in the 460.xx change log suggested why it broke, so my guess is that nVidia doesn't even test cgroups. Given that I was unable to blacklist regular **/dev/fb*** devices (including an nVidia GPU using nouveau), I have my doubts whether this is even supposed to work in any official capacity.
If your organization is willing to use 450.xx, then I am willing to push the patch. Otherwise, I would suggest pursuing the matter with nVidia.
I have pushed the commit, since it at least allows cgroups to be used with the EGL back end and v450.xx and earlier. That's the limit of what I can do at the moment.
I normally use the shell script below to choose devices, but I think this might be a problem with how the SLURM Generic RESource (GRES) management handles cgroups. Pyxis (https://github.com/NVIDIA/pyxis) might make things a bit better, or not.
From https://github.com/ehfd/docker-nvidia-egl-desktop/blob/main/bootstrap.sh
printf "3\nn\nx\n" | sudo /opt/VirtualGL/bin/vglserver_config
for DRM in /dev/dri/card*; do
if /opt/VirtualGL/bin/eglinfo "$DRM"; then
export VGL_DISPLAY="$DRM"
break
fi
done
@ehfd The commit I pushed should have effectively worked around the problem for 450.xx and earlier, but the issue with 460.xx is that EGL returns no available devices if any device is cgroup-blacklisted. I fail to see your solution works around that.
We have a SLURM cluster with several GPU nodes with multiple GPUs (NVIDIA T100, V100 and RTX). The SLURM cluster is configured to constraint the GPU based on the user's resource request using the Cgroups devices subsystem. I installed the 3.x preview brach (2.6.80) of VirtualGL in a singularity container.
When I ssh into the node as root without any SLURM allocation (so I have access to all 4 GPUs) the EGL backend works just fine .
The node has 4 cards installed (Note:
/dev/dri/card0
is the primary video display)Running eglinfo on the 4 cards works as exptected
I can run the OpenGL benchmark on all 4 cards just fine.
Next I try to submit an interactive SLURM job and request 1 GPU:
SLURM will restrict the GPU that I can access using the cgroup device subsystem:
Here I run into the first issue. Which of the
/dev/dri/cardX
devices does the cgroup allowed device map to ? It seems that this feature request is somewhat related (autoegl). My current workaround is to test all the/dev/dri/cardX
devices witheglinfo
. The correct device (/dev/dri/card1) will return the above output. If I select another device (i.e. /dev/dri/card2) will return a segmentation faultIf I use the correct device everything works fine (OpenGL benchmark etc). Next I try to run another SLURM allocation using
srun -p g --gres=gpu:1 --pty bash
. SLURM will allocate the next available GPU card:However now any of the
/dev/dri/cardX
will give me a segmentation fault when I runeglinfo
on it:Interestingly if I request all 4 GPUs with:
srun -p g --gres=gpu:4 --pty bash
, all 4 devices work fine.I guess one workaround is to not use the EGL backend but use the traditional GLX backend but it would nice, if I could get this working using the EGL backend.