Closed tim-rex closed 6 months ago
I've narrowed this down to the x11 platform report specifically.
Running eglinfo -p wayland
executes without error, and for all other platforms (android, gbm, surfaceless).
The issue occurs when specifying x11 or not specifying any platform at all
There is a known issue in the NVIDIA driver where trying to use EGL_KHR_platform_x11 and EGL_EXT_platform_device (or equivalent) in the same process causes it to trip over itself. Without any arguments, it looks like eglinfo would try to do exactly that.
But if I'm reading the source code right (which admittedly is not a given), then if you run eglinfo -p x11
, it looks like it would restrict itself to only using EGL_KHR_platform_x11.
I have some additional data points for this issue. It seems to only occur when I have a dual GPU configuration under Wayland.
Specifically..
The issue does not present when I have amdgpu blacklisted. It also does not present when using nouveau (with or without amdgpu).
It seems only to occur very specifically with nvidia + amdgpu under Gnome Wayland.. I've since changed by setup a little to try and mitigate some of these issues, and the seg fault has now changed slightly..
The issue originally reported only occurred when querying for X11, but now it only occurs when querying for GBM support.
I just noticed now while pulling logs to demonstrate the primary device, there is an issue with GBM surface creation during Gnome startup.
gnome-shell[1890]: Running GNOME Shell (using mutter 45.1) as a Wayland display server
gnome-shell[1890]: Made thread 'KMS thread' realtime scheduled
gnome-shell[1890]: Device '/dev/dri/card1' prefers shadow buffer
gnome-shell[1890]: Added device '/dev/dri/card1' (nvidia-drm) using atomic mode setting.
gnome-shell[1890]: Device '/dev/dri/card0' prefers shadow buffer
gnome-shell[1890]: Added device '/dev/dri/card0' (amdgpu) using atomic mode setting.
gnome-shell[1890]: Created gbm renderer for '/dev/dri/card1'
gnome-shell[1890]: Created gbm renderer for '/dev/dri/card0'
gnome-shell[1890]: GPU /dev/dri/card0 selected primary given udev rule
gnome-shell[1890]: Obtained a high priority EGL context
gnome-shell[1890]: Obtained a high priority EGL context
gnome-shell[1890]: Secondary GPU initialization failed (Failed to create gbm_surface: Function not implemented). Falling back to GPU-less mode instead, so the secondary monitor may be slow to update.
Note: I have a udev rule to preference amdgpu as the primary interface as this is the only way I can get both devices to coexist peacefully. If I preference nvidia as the primary interface I only get a usable display through the nvidia card, but eglinfo still produces the seg fault when querying for GBM support.
Finally.. I can run eglinfo without seg faulting if I set __EGL_VENDOR_LIBRARY_FILENAME to mesa (as I've seen suggested for other issues).. but other issues present (not unsurprisingly I expect).
probably unrelated I wonder if the failure to create a gbm surface would explain some of the other issues I'm having when targeting the RTX 960 (Maybe? I don't know much about gbm)..
Failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a
__EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json
reports libEGL debug: EGL user error 0x3004 (EGL_BAD_ATTRIBUTE) in eglGetPlatformDisplay
As for what I did to cause the behaviour to change (between failing for gbm vs x11) I'm not certain, I've not been able to narrow that down yet.
I've confirmed my egl-wayland is v1.1.13
I've attached the output of inxi if that's helpful. inxi.txt
Here's the list of modules reported when the core dumps
Module libxshmfence.so.1 from rpm libxshmfence-1.3-13.fc39.x86_64
Module libxcb-sync.so.1 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-present.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-dri3.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-xfixes.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-dri2.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libX11-xcb.so.1 from rpm libX11-1.8.7-1.fc39.x86_64
Module libglapi.so.0 from rpm mesa-23.2.1-2.fc39.x86_64
Module libEGL_mesa.so.0 from rpm mesa-23.2.1-2.fc39.x86_64
Module libXau.so.6 from rpm libXau-1.0.11-3.fc39.x86_64
Module libxcb.so.1 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-randr.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libexpat.so.1 from rpm expat-2.5.0-3.fc39.x86_64
Module libgbm.so.1 from rpm mesa-23.2.1-2.fc39.x86_64
Module libnvidia-egl-gbm.so.1 from rpm egl-gbm-1.1.0-5.fc39.x86_64
Module libffi.so.8 from rpm libffi-3.4.4-4.fc39.x86_64
Module libdrm.so.2 from rpm libdrm-2.4.117-1.fc39.x86_64
Module libwayland-client.so.0 from rpm wayland-1.22.0-2.fc39.x86_64
Module libwayland-server.so.0 from rpm wayland-1.22.0-2.fc39.x86_64
Module libnvidia-egl-wayland.so.1 from rpm egl-wayland-1.1.13-1.fc39.x86_64
Module libGLdispatch.so.0 from rpm libglvnd-1.7.0-1.fc39.x86_64
Module libEGL.so.1 from rpm libglvnd-1.7.0-1.fc39.x86_64
Module eglinfo from rpm mesa-demos-9.0.0-3.fc39.x86_64
Running eglinfo in a debug session, here's the stack frame when it bails
gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff7e438a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007ffff7df18ee in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007ffff7dd98ff in __GI_abort () at abort.c:79
#4 0x00007ffff7dda7d0 in __libc_message (fmt=fmt@entry=0x7ffff7f5756a "%s\n") at ../sysdeps/posix/libc_fatal.c:150
#5 0x00007ffff7e4d795 in malloc_printerr (str=str@entry=0x7ffff7f54fa7 "corrupted size vs. prev_size") at malloc.c:5765
#6 0x00007ffff7e4e1b6 in unlink_chunk (p=<optimized out>, av=<optimized out>) at malloc.c:1610
#7 0x00007ffff7e4e343 in malloc_consolidate (av=0x7ffff7f8bac0 <main_arena>) at malloc.c:4869
#8 0x00007ffff7e4f4c5 in _int_free_maybe_consolidate (av=0x7ffff7f8bac0 <main_arena>, size=<optimized out>) at malloc.c:4772
#9 0x00007ffff7e4f7ee in _int_free_maybe_consolidate (size=<optimized out>, av=<optimized out>) at malloc.c:4695
#10 0x00007ffff7e4f9da in _int_free (av=<optimized out>, p=p@entry=0x5555557fa600, have_lock=<optimized out>, have_lock@entry=0) at malloc.c:4639
#11 0x00007ffff7e523ce in __GI___libc_free (mem=0x5555557fa610) at malloc.c:3391
#12 0x00007ffff5d330d6 in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#13 0x00007ffff5ca795a in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#14 0x00007ffff5a3b1dd in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#15 0x00007ffff5d6200b in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#16 0x00007ffff5d4135d in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#17 0x00007ffff7841f21 in ?? () from /lib64/libEGL_nvidia.so.0
#18 0x00007ffff7841f69 in ?? () from /lib64/libEGL_nvidia.so.0
#19 0x00007ffff7834caa in ?? () from /lib64/libEGL_nvidia.so.0
#20 0x00007ffff784329c in ?? () from /lib64/libEGL_nvidia.so.0
#21 0x00007ffff7848976 in ?? () from /lib64/libEGL_nvidia.so.0
#22 0x0000555555560543 in doOneDisplay (d=0x555555637330, name=<optimized out>, opts=...) at ../src/egl/opengl/eglinfo.c:616
#23 0x0000555555558720 in main (argc=<optimized out>, argv=<optimized out>) at ../src/egl/opengl/eglinfo.c:850
The call site for frame 22 is line 616 in /usr/src/debug/mesa-demos-9.0.0-3.fc39.x86_64/redhat-linux-build/../src/egl/opengl/eglinfo.c
614 if (ctx) {
615 if (doOneContext(d, ctx, "OpenGL ES profile", version, opts) == 0)
616 if (!eglDestroyContext(d, ctx))
617 return 1;
618 }
I'm able to isolate a different crash if I call eglinfo -p gbm -a gl
(or by specifying any other api via the -a
switch)
This execution is able to create an EGL context (core profile 4.6)
The crash occurs upon eglTerminate()
triggering eGbmTerminateHook()
0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff7e438a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007ffff7df18ee in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007ffff7dd98ff in __GI_abort () at abort.c:79
#4 0x00007ffff7dda7d0 in __libc_message (fmt=fmt@entry=0x7ffff7f5756a "%s\n") at ../sysdeps/posix/libc_fatal.c:150
#5 0x00007ffff7e4d795 in malloc_printerr (str=str@entry=0x7ffff7f55043 "free(): invalid size") at malloc.c:5765
#6 0x00007ffff7e4fa9c in _int_free (av=<optimized out>, p=p@entry=0x516de0, have_lock=have_lock@entry=0) at malloc.c:4504
#7 0x00007ffff7e523ce in __GI___libc_free (mem=0x516df0) at malloc.c:3391
#8 0x00007ffff78458b4 in ?? () from /lib64/libEGL_nvidia.so.0
#9 0x00007ffff7834c4c in ?? () from /lib64/libEGL_nvidia.so.0
#10 0x00007ffff78346ba in ?? () from /lib64/libEGL_nvidia.so.0
#11 0x00007ffff78348c5 in ?? () from /lib64/libEGL_nvidia.so.0
#12 0x00007ffff784916b in ?? () from /lib64/libEGL_nvidia.so.0
#13 0x00007ffff7fbbb26 in eGbmTerminateHook (dpy=<optimized out>) at ../src/gbm-display.c:293
#14 0x00007ffff78abc40 in ?? () from /lib64/libEGL_nvidia.so.0
#15 0x00007ffff7849147 in ?? () from /lib64/libEGL_nvidia.so.0
#16 0x000000000040371e in doOneDisplay (d=0x501390, name=0x430139 "GBM", opts=...) at ../src/egl/opengl/eglinfo.c:681
#17 0x0000000000403f39 in main (argc=5, argv=0x7fffffffdd08) at ../src/egl/opengl/eglinfo.c:902
Ignore the line number of eglinfo.c, I've built this locally myself with additional code to debug.
So, to summarise.. There are two sources of crash.
eglinfo -p gbm
appears to be first creating an EGL_OPENGL_API context, which succeeds. It then destroys that context and proceeds to create a subsequent EGL_OPENGL_ES_API context, which crashes during eglCreateContext
eglinfo -p gbm -a glcore
, where it also succeeds to create and use the context, but fails during eglTerminate()
somewhere behind eGbmTerminateHook
with the stack trace reported in this commentPlease advise if there are any further details I can provide.
Just for fun.. while trying to get a minimal repro I'm seeing the crash behaviour changes depending on wether or not I have called eglQueryString(EGL_NO_DISPLAY, EGL_EXTENSIONS);
ahead of eglGetPlatformDisplayEXT()
If I query for EGL_EXTENSIONS ahead of eglGetPlatformDisplay, I can see the relevant nVidia modules are loaded
Downloading separate debug info for /lib64/libEGL_nvidia.so.0
Downloading separate debug info for /lib64/libnvidia-glsi.so.535.129.03
Downloading separate debug info for /lib64/libnvidia-eglcore.so.535.129.03
Downloading separate debug info for /lib64/libnvidia-glvkspirv.so.535.129.03
eglGetPlatformDisplayEXT has address f7f9c640
Downloading separate debug info for /usr/lib64/gbm/nvidia-drm_gbm.so
The context will be created succesfully. Exiting the process immediately after context creation will produce free(): invalid size
If I instead choose not to query for GL_EXTENSIONS, this is where eglCreateContext encounters corrupted size vs. prev_size
No more updates on this now. Apologies for the noisy thread.
This seems related.
Also, off-topic, but that eglinfo crash did end up being a silly bug in the NVIDIA driver. It will be fixed in the next release.
Yeah, that should be fixed in the latest driver. Are you installing libnvidia-egl-gbm.so from the main driver package or does your distro package it separately from the GitHub mirror https://github.com/NVIDIA/egl-gbm?
Someone else actually opened a PR with the exact same fix on the public copy https://github.com/NVIDIA/egl-gbm/pull/3
It appears to be packaged seperately, direct from the github mirror.
/usr/lib64/libnvidia-egl-gbm.so.1
provided by egl-gbm-1.1.0-5.fc39.x86_64
Name : egl-gbm
Version : 1.1.0
Release : 5.fc39
Architecture : x86_64
Size : 33 k
Source : egl-gbm-1.1.0-5.fc39.src.rpm
Repository : @System
From repo : fedora-modular
Summary : Nvidia egl gbm libary
URL : https://github.com/NVIDIA/egl-gbm
License : MIT
Description : Nvidia egl gbm libary
ldd shows it to link against the following
linux-vdso.so.1 (0x00007ffd0316f000)
libgbm.so.1 => /lib64/libgbm.so.1 (0x00007f714a40b000)
libdrm.so.2 => /lib64/libdrm.so.2 (0x00007f714a3f4000)
libc.so.6 => /lib64/libc.so.6 (0x00007f714a212000)
libwayland-server.so.0 => /lib64/libwayland-server.so.0 (0x00007f714a1fb000)
libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f714a1d0000)
libxcb-randr.so.0 => /lib64/libxcb-randr.so.0 (0x00007f714a1bc000)
libm.so.6 => /lib64/libm.so.6 (0x00007f714a0db000)
/lib64/ld-linux-x86-64.so.2 (0x00007f714a440000)
libffi.so.8 => /lib64/libffi.so.8 (0x00007f714a0cb000)
libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f714a0a0000)
libXau.so.6 => /lib64/libXau.so.6 (0x00007f714a09a000)
I've just tried building the PR and it seems to solve the issue for both crash scenarios. eglinfo runs clean
Thanks @erik-kz
Unfotunately this did not seem to help the probably unrelated issues mentioned in https://github.com/NVIDIA/egl-wayland/issues/93#issuecomment-1817784724
I'll keep working away on those independently and raise a seperate issue when I have something more concrete/succinct.
Can confirm this appears to be resolved with v545.29.06 I think we can close this issue
eglinfo is segfaulting
Fedora Linux 39 (Workstation Edition) Linux 6.5.11-300.fc39.x86_64 GNOME Version 45.1 nVidia Driver version 535.129.03
Output of eglinfo attached eglinfo.txt
Coredump attached 12601.core.gz
Backtrace follows