VirtualGL / virtualgl

Main VirtualGL repository
https://VirtualGL.org
Other
701 stars 106 forks source link

EGL back end: firefox not displaying correctly under VirtualGL and TurboVNC #216

Closed richardnixonshead closed 1 year ago

richardnixonshead commented 2 years ago

We run a Slurm cluster for HPC, using RHEL7. We're trying to enable an app that requires Firefox to run on the cluster. We don't run any dedicated 2d or 3d XServers - it is all on demand via TurboVNC and VirtualGL.

# rpm -qa | grep 'firefox|VirtualGL|turbovnc' firefox-91.12.0-2.el7_9.x86_64 turbovnc-3.0.1-20220815.x86_64 VirtualGL-3.0.80-20221020.x86_64

# ps -ef | grep scrosby scrosby 263378 263373 0 13:16 ? 00:00:00 /bin/bash /var/spool/slurm/job11681/slurm_script scrosby 263500 1 2 13:16 ? 00:00:27 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: thespian-gpgpu001.hpc.unimelb.edu.au:1 (scrosby) -auth /home/scrosby/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/scrosby/.vnc/x509_cert.pem -x509key /home/scrosby/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0 scrosby 263526 263378 0 13:16 ? 00:00:00 bash /home/scrosby/ondemand/data/sys/dashboard/batch_connect/sys/cryosparc/output/24606fdc-7f03-4a24-80a4-90d255a185c2/script.sh scrosby 263556 1 0 13:16 ? 00:00:00 dbus-launch --autolaunch 9ac174082e47417987cbb516e69dc378 --binary-syntax --close-stderr scrosby 263558 1 0 13:16 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 6 --print-address 8 --session scrosby 263566 1 0 13:16 ? 00:00:00 /usr/lib64/xfce4/xfconf/xfconfd scrosby 263585 263526 0 13:16 ? 00:00:00 xfce4-session

When we launch e.g. glxgears via vglrun in this configuration, it works fine

$ vglrun +v glxgears [VGL] Shared memory segment ID for vglconfig: 360472 [VGL] VirtualGL v3.0.80 64-bit (Build 20221020) [VGL] Opening EGL device /dev/dri/card1 [VGL] Using pixel buffer objects for readback (BGR --> BGRA) [VGL] ERROR: in readback-- [VGL] 288: Window has been deleted by window manager

I have attached the eglinfo output for /dev/dri/card1 as well.

eglinfo.txt

But Firefox (and Chromium) refuses to draw correctly. When run using vglrun +v, it doesn't look like it ever even tries to access the EGL device.

Attached screenshot of what Firefox looks like firefox

$ vglrun +v firefox [VGL] Shared memory segment ID for vglconfig: 360477 [VGL] VirtualGL v3.0.80 64-bit (Build 20221020) Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.35156) [GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.35156) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.351611) [GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.35156) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.351611) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.351646) [GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary [Parent 273211, IPC I/O Parent] WARNING: Failed to launch tab subprocess: file /builddir/build/BUILD/firefox-91.12.0/ipc/glue/GeckoChildProcessHost.cpp:731 [Parent 273211, IPC I/O Parent] WARNING: Failed to launch tab subprocess: file /builddir/build/BUILD/firefox-91.12.0/ipc/glue/GeckoChildProcessHost.cpp:731

Can anyone help us debug this issue?

dcommander commented 2 years ago

It appears that Firefox may be trying to use the EGL_MESA_query_driver extension even though VirtualGL doesn't report it as being available. (That's a real pet peeve of mine, BTW. Software should never assume availability of an OpenGL, EGL, or GLX extension without checking, but a lot of it unfortunately does.) I'll see if I can reproduce the issue and interpose that extension if necessary.

dcommander commented 2 years ago

Firefox reports the same error regarding eglGetDisplayDriverName() when started on the local display without VirtualGL, and it appears that nVidia doesn't even support the EGL_MESA_query_driver extension. Thus, I'm not sure what's wrong. I tried the same Firefox version on my CentOS 7.9 machine with VirtualGL 3.1 evolving and a Quadro P620 with the latest nVidia drivers. It reports the eglGetDisplayDriverName() error, as you observed, but otherwise works fine.

dcommander commented 1 year ago

Can you confirm which GPU/driver revision you are using?

dcommander commented 1 year ago

@richardnixonshead Unless you can provide me with the requested information or any other information that might help me reproduce the issue, then I will have no choice but to close this bug report. I can't fix issues that I can't reproduce. Please confirm which GPU/driver revision you are using.

dcommander commented 1 year ago

Also, I notice that you are using 3.1 alpha. Please re-test with the latest 3.1 post-beta pre-release: https://virtualgl.org/DeveloperInfo/PreReleases

A timely response is appreciated, as I am trying to wrap up bug reports and release 3.1 ASAP.

richardnixonshead commented 1 year ago

We're using Nvidia A100's with driver 515.65.01

I'll change to the latest 3.1, and let you know if anything has changed

dcommander commented 1 year ago

I'm seeing a less severe but probably related issue. I am attempting to nail it down.

richardnixonshead commented 1 year ago

It seems I can overcome the issue by setting

Exec=env MOZ_DISABLE_CONTENT_SANDBOX=1 firefox %u

in the Firefox desktop file, but without it, with the new VirtualGL, it still fails to launch.

dcommander commented 1 year ago

Since there are several open issues regarding Firefox and Chrome, I am thoroughly testing multiple versions to try and nail down the correct recipes. Stand by.

dcommander commented 1 year ago

https://github.com/VirtualGL/virtualgl/issues/228#issuecomment-1458759288

documents everything I know about VirtualGL, Firefox, and Chrome as of today, and the application recipes will soon be updated accordingly. I observed that all versions of Firefox < v94 require MOZ_DISABLE_CONTENT_SANDBOX=1 when using the GLX back end. However, I didn't observe any need for MOZ_DISABLE_CONTENT_SANDBOX=1 with the EGL back end, so your observations still don't fully jive with mine.

dcommander commented 1 year ago

@richardnixonshead Can you help me reconcile your observations with mine? What you said above suggests that MOZ_DISABLE_CONTENT_SANDBOX=1 is necessary on your system when running Firefox v91 with the EGL back end, but my testing shows that that environment variable is only necessary when running Firefox v91 (actually any version < v94) with the GLX back end. I am running CentOS 7.9, so it should be a similar computing environment to yours. When I don't set MOZ_DISABLE_CONTENT_SANDBOX=1 with Firefox < 94 and the GLX back end, I get a VirtualGL error (ERROR: Could not open display :0) because the content sandbox prevents VirtualGL from opening a connection to the 3D X server, and the WebGL tab subsequently crashes. Since the EGL back end doesn't open a connection to the 3D X server, it works for me without MOZ_DISABLE_CONTENT_SANDBOX=1.

Also, please test the latest 3.1 post-beta pre-release (https://virtualgl.org/DeveloperInfo/PreReleases), as https://github.com/VirtualGL/virtualgl/commit/7a2d5c7d6312893c018d65f497ad3c3b6d9fb108 may have changed the behavior you observed (hopefully for the better.)

I'm mainly just trying to figure out whether I should modify the documentation to recommend always setting MOZ_DISABLE_CONTENT_SANDBOX=1 with Firefox < v94 or whether I should keep the current suggestion of setting that environment variable only when using the GLX back end with Firefox < v94.

dcommander commented 1 year ago

@richardnixonshead A timely response is appreciated. The VirtualGL 3.1 release is overdue, and I need to understand the differences in our observations so I can modify the documentation if necessary.

richardnixonshead commented 1 year ago

No luck I'm afraid with the 3.1 release. glxgears still working with EGL, but Firefox is still not working.

For completeness

[scrosby@thespian-gpgpu002 ~]$ export VGL_DISPLAY=/dev/dri/card1 [scrosby@thespian-gpgpu002 ~]$ /opt/VirtualGL/bin/eglinfo $VGL_DISPLAY device: /dev/dri/card1 EGL client APIs string: OpenGL_ES OpenGL EGL vendor string: NVIDIA EGL version string: 1.5 display EGL extensions: EGL_EXT_buffer_age, EGL_EXT_client_sync, EGL_EXT_create_context_robustness, EGL_EXT_image_dma_buf_import, EGL_EXT_image_dma_buf_import_modifiers, EGL_EXT_output_base, EGL_EXT_output_drm, EGL_EXT_present_opaque, EGL_EXT_protected_content, EGL_EXT_stream_acquire_mode, EGL_EXT_stream_consumer_egloutput, EGL_EXT_sync_reuse, EGL_IMG_context_priority, EGL_KHR_config_attribs, ...

[scrosby@thespian-gpgpu002 ~]$ ls -la /dev/dri/ total 0 drwxr-xr-x 2 root root 220 Jan 6 14:01 . drwxr-xr-x 23 root root 3620 Mar 13 10:03 .. crw-rw-rw- 1 root root 226, 0 Jan 6 14:01 card0 crw-rw-rw- 1 root root 226, 1 Jan 6 14:01 card1 crw-rw-rw- 1 root root 226, 2 Jan 6 14:01 card2 crw-rw-rw- 1 root root 226, 3 Jan 6 14:01 card3 crw-rw-rw- 1 root root 226, 4 Jan 6 14:01 card4 crw-rw-rw- 1 root root 226, 128 Jan 6 14:01 renderD128 crw-rw-rw- 1 root root 226, 129 Jan 6 14:01 renderD129 crw-rw-rw- 1 root root 226, 130 Jan 6 14:01 renderD130 crw-rw-rw- 1 root root 226, 131 Jan 6 14:01 renderD131

I note a lot of errors in the Firefox output when I start it, mentioning errors with more than 2 GPUs. Could that be the difference between your and my setup?

[scrosby@thespian-gpgpu002 ~]$ vglrun -c proxy firefox [VGL] NOTICE: Automatically setting VGL_CLIENT environment variable to [VGL] 127.0.0.1, the IP address of your SSH client. Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598618) [GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598618) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598662) [GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598618) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598662) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.59869) [GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598618) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598662) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.59869) |[3][GFX1-]: Failed GL context creation for WebRender: 0 (t=1.07011) [GFX1-]: Failed GL context creation for WebRender: 0 Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598618) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598662) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.59869) |[3][GFX1-]: Failed GL context creation for WebRender: 0 (t=1.07011) |[4][GFX1-]: FEATURE_FAILURE_WEBRENDER_INITIALIZE_UNSPECIFIED (t=1.07014) [GFX1-]: FEATURE_FAILURE_WEBRENDER_INITIALIZE_UNSPECIFIED Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598618) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598662) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.59869) |[3][GFX1-]: Failed GL context creation for WebRender: 0 (t=1.07011) |[4][GFX1-]: FEATURE_FAILURE_WEBRENDER_INITIALIZE_UNSPECIFIED (t=1.07014) |[5][GFX1-]: Failed to connect WebRenderBridgeChild. isParent=true (t=1.0702) [GFX1-]: Failed to connect WebRenderBridgeChild. isParent=true Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598618) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.598662) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.59869) |[3][GFX1-]: Failed GL context creation for WebRender: 0 (t=1.07011) |[4][GFX1-]: FEATURE_FAILURE_WEBRENDER_INITIALIZE_UNSPECIFIED (t=1.07014) |[5][GFX1-]: Failed to connect WebRenderBridgeChild. isParent=true (t=1.0702) |[6][GFX1-]: Fallback WR to SW-WR (t=1.0703) [GFX1-]: Fallback WR to SW-WR

I'm running on a node with 4xNvidia A100 GPUs in it.

dcommander commented 1 year ago

Please confirm which build you are using. There is no "3.1 release" yet. I assume you are using the latest 3.1 post-beta pre-release build from https://virtualgl.org/DeveloperInfo/PreReleases? I have tested this on a multi-GPU system, so that isn't the problem.

Most of those errors are expected and should be innocuous with Firefox v91. (Refer to https://github.com/VirtualGL/virtualgl/issues/228#issuecomment-1458759288.) Firefox v80-v106 can't use EGL/X11 with nVidia GPUs, because it looks for the EGL_MESA_query_driver extension, which isn't available in nVidia's EGL stack. That's why it prints all of those errors about "libEGL missing eglGetDisplayDriverName." However, because of https://github.com/VirtualGL/virtualgl/commit/7a2d5c7d6312893c018d65f497ad3c3b6d9fb108, it should now gracefully fall back to using GLX, as it did with VirtualGL 3.0.2. I think the only relevant error is "Failed GL context creation for WebRender: 0". In my testing with various versions of Firefox, I never saw that error.

I notice one thing that's a bit curious, though. When you ran vglrun +v glxgears, VirtualGL printed Opening EGL device /dev/dri/card1, implying to VGL_DISPLAY was set to /dev/dri/card1. However, it didn't print that message when you ran vglrun +v firefox. If your machine is set up only to use the EGL back end, then that could explain the failure. Make sure that VGL_DISPLAY is properly set before running Firefox with VirtualGL, or explicitly pass -d /dev/dri/card1 to vglrun.

richardnixonshead commented 1 year ago

[scrosby@thespian-gpgpu002 ~]$ rpm -qa VirtualGL VirtualGL-3.0.91-20230312.x86_64

[scrosby@thespian-gpgpu002 ~]$ vglrun -c proxy +v glxgears [VGL] NOTICE: Automatically setting VGL_CLIENT environment variable to [VGL] 127.0.0.1, the IP address of your SSH client. [VGL] Shared memory segment ID for vglconfig: 32824 [VGL] VirtualGL v3.0.91 64-bit (Build 20230312) [VGL] Opening EGL device /dev/dri/card1 [VGL] Using pixel buffer objects for readback (BGR --> BGRA) [VGL] ERROR: in readback-- [VGL] 288: Window has been deleted by window manager [VGL] Shared memory segment ID for vglconfig: 32828

[scrosby@thespian-gpgpu002 ~]$ vglrun -c proxy +v firefox [VGL] NOTICE: Automatically setting VGL_CLIENT environment variable to [VGL] 127.0.0.1, the IP address of your SSH client. [VGL] Shared memory segment ID for vglconfig: 32829 [VGL] VirtualGL v3.0.91 64-bit (Build 20230312) Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.497049) [GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.497049) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.497089) [GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.497049) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.497089) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.497111) [GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary [VGL] NOTICE: Replacing dlopen("libGL.so.1") with dlopen("libvglfaker.so") [VGL] Opening EGL device /dev/dri/card1 [VGL] Using pixel buffer objects for readback (BGRA --> BGRA) Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.497049) |[1][GFX1-]: glxtest: libEGL missing eglGetDisplayDriverName (t=0.497089) |[2][GFX1-]: More than 2 GPUs detected via PCI, secondary GPU is arbitrary (t=0.497111) |[3][GFX1-]: Unrecognized feature ACCELERATED_CANVAS2D (t=1.76241) [GFX1-]: Unrecognized feature ACCELERATED_CANVAS2D

I've attached what it looks like currently with the new version

firefox

dcommander commented 1 year ago

Something else is odd. vglrun seems to be automatically setting VGL_CLIENT to 127.0.0.1, which it wouldn't do unless you were in an SSH session. Why would there be an SSH session from 127.0.0.1? That message about VGL_CLIENT wasn't in your original report.

dcommander commented 1 year ago

Also, you never explained how the EGL back end is enabled. Are you explicitly setting VGL_DISPLAY?

richardnixonshead commented 1 year ago

The nodes are part of a Slurm HPC cluster. We set the VGL_DISPLAY variable in jobs so that users can use the EGL backend if required.

This sets VGL_DISPLAY if running on a GPU node

if [ -x /usr/bin/nvidia-smi ] && [ "z${CUDA_VISIBLE_DEVICES}z" != "zz" ]; then

Get first GPU available to user

Strip first 4 characters and convert to lower case

GPUID=$(/usr/bin/nvidia-smi --query-gpu=gpu_bus_id --format=csv,noheader | head -n1 | sed 's/^....//' | tr '[:upper:]' '[:lower:]') DRIDEVICE="" if [ -d /sys/bus/pci/devices/${GPUID}/drm ]; then DRIDEVICE=$(basename /sys/bus/pci/devices/${GPUID}/drm/card*) fi if [ "z${DRIDEVICE}z" != "zz" ]; then echo export VGL_DISPLAY="/dev/dri/${DRIDEVICE}" fi fi

dcommander commented 1 year ago

To clarify, does MOZ_DISABLE_CONTENT_SANDBOX=1 still work around the issue from your point of view?

richardnixonshead commented 1 year ago

Yep. Still working with that env variable.

Tomorrow morning I'll try a different VNC server (currently using TurboVNC). I'm happy if you want to just release the new VirtualGL without this issue. EGL for everything but Firefox seems to be working great for us - MATLAB, Paraview etc are all fine.

dcommander commented 1 year ago

If it works with that environment variable, then it is just a documentation issue. I have already observed and documented that setting the same environment variable is necessary when using the GLX back end with Firefox < v94. I just need to document that it is necessary with the EGL back end as well. Even though I don't observe that on my configuration, setting the environment variable does no harm.

dcommander commented 1 year ago

Note also: Firefox v107 and later are much better behaved in general. You should be able to install the official binaries for v110 on RHEL 7 and use that instead of v91.

dcommander commented 1 year ago

Some googling suggests that this issue might be due to Firefox looking for a 32-bit-depth visual and failing to find one, which may point to a logic issue with the GLXFBConfig-to-visual mapping in the EGL back end. That would also explain why the issue is only reproducible on a Tesla and not a Quadro.

Can you post the output of the following commands, all run from inside the TurboVNC session?

vglrun /opt/VirtualGL/bin/glxinfo
vglrun /opt/VirtualGL/bin/glxinfo -c
vglrun /opt/VirtualGL/bin/eglxinfo
dcommander commented 1 year ago

Closing this issue as Documented, but please get back to me with the glxinfo/eglxinfo output. I am mainly interested in pursuing that angle in case it reveals an oversight in the EGL back end that may affect other applications.