NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 43 forks source link

eglinfo seg fault (Fedora 39, Gnome 45.1) #93

Closed tim-rex closed 6 months ago

tim-rex commented 7 months ago

eglinfo is segfaulting

Fedora Linux 39 (Workstation Edition) Linux 6.5.11-300.fc39.x86_64 GNOME Version 45.1 nVidia Driver version 535.129.03

Output of eglinfo attached eglinfo.txt

Coredump attached 12601.core.gz

Backtrace follows

Missing separate debuginfos, use: dnf debuginfo-install egl-gbm-1.1.0-5.fc39.x86_64 egl-wayland-1.1.13-1.fc39.x86_64 expat-2.5.0-3.fc39.x86_64 glibc-2.38-10.fc39.x86_64 libX11-xcb-1.8.7-1.fc39.x86_64 libXau-1.0.11-3.fc39.x86_64 libdrm-2.4.117-1.fc39.x86_64 libffi-3.4.4-4.fc39.x86_64 libgcc-13.2.1-4.fc39.x86_64 libglvnd-egl-1.7.0-1.fc39.x86_64 libwayland-client-1.22.0-2.fc39.x86_64 libwayland-server-1.22.0-2.fc39.x86_64 libxcb-1.13.1-12.fc39.x86_64 libxshmfence-1.3-13.fc39.x86_64 mesa-libEGL-23.2.1-2.fc39.x86_64 mesa-[Thread debugging using libthread_db enabled]                                                                                                                                 
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/eglinfo'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;                                                                                       
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007fc8ea0e68a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007fc8ea0948ee in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fc8ea07c8ff in __GI_abort () at abort.c:79
#4  0x00007fc8ea07d7d0 in __libc_message (fmt=fmt@entry=0x7fc8ea1fa56a "%s\n") at ../sysdeps/posix/libc_fatal.c:150
#5  0x00007fc8ea0f0795 in malloc_printerr (str=str@entry=0x7fc8ea1f7fa7 "corrupted size vs. prev_size") at malloc.c:5765
#6  0x00007fc8ea0f11b6 in unlink_chunk (p=<optimized out>, av=<optimized out>) at malloc.c:1610
#7  0x00007fc8ea0f1343 in malloc_consolidate (av=0x7fc8ea22eac0 <main_arena>) at malloc.c:4869
#8  0x00007fc8ea0f24c5 in _int_free_maybe_consolidate (av=0x7fc8ea22eac0 <main_arena>, size=<optimized out>) at malloc.c:4772
#9  0x00007fc8ea0f27ee in _int_free_maybe_consolidate (size=<optimized out>, av=<optimized out>) at malloc.c:4695
#10 0x00007fc8ea0f29da in _int_free (av=<optimized out>, p=p@entry=0x55d3d86dc900, have_lock=<optimized out>, have_lock@entry=0) at malloc.c:4639
#11 0x00007fc8ea0f53ce in __GI___libc_free (mem=0x55d3d86dc910) at malloc.c:3391
#12 0x00007fc8e7f330d6 in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#13 0x00007fc8e7ea795a in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#14 0x00007fc8e7c3b1dd in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#15 0x00007fc8e7f6200b in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#16 0x00007fc8e7f4135d in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#17 0x00007fc8e9a41f21 in ?? () from /lib64/libEGL_nvidia.so.0
#18 0x00007fc8e9a41f69 in ?? () from /lib64/libEGL_nvidia.so.0
#19 0x00007fc8e9a34caa in ?? () from /lib64/libEGL_nvidia.so.0
#20 0x00007fc8e9a4329c in ?? () from /lib64/libEGL_nvidia.so.0
#21 0x00007fc8e9a48976 in ?? () from /lib64/libEGL_nvidia.so.0
#22 0x000055d3d706d543 in doOneDisplay (d=0x55d3d8519330, name=<optimized out>, opts=...) at ../src/egl/opengl/eglinfo.c:616
#23 0x000055d3d7065720 in main (argc=<optimized out>, argv=<optimized out>) at ../src/egl/opengl/eglinfo.c:850
tim-rex commented 7 months ago

I've narrowed this down to the x11 platform report specifically. Running eglinfo -p wayland executes without error, and for all other platforms (android, gbm, surfaceless).

The issue occurs when specifying x11 or not specifying any platform at all

kbrenneman commented 7 months ago

There is a known issue in the NVIDIA driver where trying to use EGL_KHR_platform_x11 and EGL_EXT_platform_device (or equivalent) in the same process causes it to trip over itself. Without any arguments, it looks like eglinfo would try to do exactly that.

But if I'm reading the source code right (which admittedly is not a given), then if you run eglinfo -p x11, it looks like it would restrict itself to only using EGL_KHR_platform_x11.

tim-rex commented 7 months ago

I have some additional data points for this issue. It seems to only occur when I have a dual GPU configuration under Wayland.

Specifically..

The issue does not present when I have amdgpu blacklisted. It also does not present when using nouveau (with or without amdgpu).

It seems only to occur very specifically with nvidia + amdgpu under Gnome Wayland.. I've since changed by setup a little to try and mitigate some of these issues, and the seg fault has now changed slightly..

The issue originally reported only occurred when querying for X11, but now it only occurs when querying for GBM support.

I just noticed now while pulling logs to demonstrate the primary device, there is an issue with GBM surface creation during Gnome startup.

gnome-shell[1890]: Running GNOME Shell (using mutter 45.1) as a Wayland display server
gnome-shell[1890]: Made thread 'KMS thread' realtime scheduled
gnome-shell[1890]: Device '/dev/dri/card1' prefers shadow buffer
gnome-shell[1890]: Added device '/dev/dri/card1' (nvidia-drm) using atomic mode setting.
gnome-shell[1890]: Device '/dev/dri/card0' prefers shadow buffer
gnome-shell[1890]: Added device '/dev/dri/card0' (amdgpu) using atomic mode setting.
gnome-shell[1890]: Created gbm renderer for '/dev/dri/card1'
gnome-shell[1890]: Created gbm renderer for '/dev/dri/card0'
gnome-shell[1890]: GPU /dev/dri/card0 selected primary given udev rule
gnome-shell[1890]: Obtained a high priority EGL context
gnome-shell[1890]: Obtained a high priority EGL context
gnome-shell[1890]: Secondary GPU initialization failed (Failed to create gbm_surface: Function not implemented). Falling back to GPU-less mode instead, so the secondary monitor may be slow to update.

Note: I have a udev rule to preference amdgpu as the primary interface as this is the only way I can get both devices to coexist peacefully. If I preference nvidia as the primary interface I only get a usable display through the nvidia card, but eglinfo still produces the seg fault when querying for GBM support.

Finally.. I can run eglinfo without seg faulting if I set __EGL_VENDOR_LIBRARY_FILENAME to mesa (as I've seen suggested for other issues).. but other issues present (not unsurprisingly I expect).

probably unrelated I wonder if the failure to create a gbm surface would explain some of the other issues I'm having when targeting the RTX 960 (Maybe? I don't know much about gbm)..

As for what I did to cause the behaviour to change (between failing for gbm vs x11) I'm not certain, I've not been able to narrow that down yet.

I've confirmed my egl-wayland is v1.1.13

I've attached the output of inxi if that's helpful. inxi.txt

Here's the list of modules reported when the core dumps

Module libxshmfence.so.1 from rpm libxshmfence-1.3-13.fc39.x86_64
Module libxcb-sync.so.1 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-present.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-dri3.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-xfixes.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-dri2.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libX11-xcb.so.1 from rpm libX11-1.8.7-1.fc39.x86_64
Module libglapi.so.0 from rpm mesa-23.2.1-2.fc39.x86_64
Module libEGL_mesa.so.0 from rpm mesa-23.2.1-2.fc39.x86_64
Module libXau.so.6 from rpm libXau-1.0.11-3.fc39.x86_64
Module libxcb.so.1 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libxcb-randr.so.0 from rpm libxcb-1.13.1-12.fc39.x86_64
Module libexpat.so.1 from rpm expat-2.5.0-3.fc39.x86_64
Module libgbm.so.1 from rpm mesa-23.2.1-2.fc39.x86_64
Module libnvidia-egl-gbm.so.1 from rpm egl-gbm-1.1.0-5.fc39.x86_64
Module libffi.so.8 from rpm libffi-3.4.4-4.fc39.x86_64
Module libdrm.so.2 from rpm libdrm-2.4.117-1.fc39.x86_64
Module libwayland-client.so.0 from rpm wayland-1.22.0-2.fc39.x86_64
Module libwayland-server.so.0 from rpm wayland-1.22.0-2.fc39.x86_64
Module libnvidia-egl-wayland.so.1 from rpm egl-wayland-1.1.13-1.fc39.x86_64
Module libGLdispatch.so.0 from rpm libglvnd-1.7.0-1.fc39.x86_64
Module libEGL.so.1 from rpm libglvnd-1.7.0-1.fc39.x86_64
Module eglinfo from rpm mesa-demos-9.0.0-3.fc39.x86_64
tim-rex commented 7 months ago

Running eglinfo in a debug session, here's the stack frame when it bails

gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7e438a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7df18ee in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7dd98ff in __GI_abort () at abort.c:79
#4  0x00007ffff7dda7d0 in __libc_message (fmt=fmt@entry=0x7ffff7f5756a "%s\n") at ../sysdeps/posix/libc_fatal.c:150
#5  0x00007ffff7e4d795 in malloc_printerr (str=str@entry=0x7ffff7f54fa7 "corrupted size vs. prev_size") at malloc.c:5765
#6  0x00007ffff7e4e1b6 in unlink_chunk (p=<optimized out>, av=<optimized out>) at malloc.c:1610
#7  0x00007ffff7e4e343 in malloc_consolidate (av=0x7ffff7f8bac0 <main_arena>) at malloc.c:4869
#8  0x00007ffff7e4f4c5 in _int_free_maybe_consolidate (av=0x7ffff7f8bac0 <main_arena>, size=<optimized out>) at malloc.c:4772
#9  0x00007ffff7e4f7ee in _int_free_maybe_consolidate (size=<optimized out>, av=<optimized out>) at malloc.c:4695
#10 0x00007ffff7e4f9da in _int_free (av=<optimized out>, p=p@entry=0x5555557fa600, have_lock=<optimized out>, have_lock@entry=0) at malloc.c:4639
#11 0x00007ffff7e523ce in __GI___libc_free (mem=0x5555557fa610) at malloc.c:3391
#12 0x00007ffff5d330d6 in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#13 0x00007ffff5ca795a in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#14 0x00007ffff5a3b1dd in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#15 0x00007ffff5d6200b in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#16 0x00007ffff5d4135d in ?? () from /lib64/libnvidia-eglcore.so.535.129.03
#17 0x00007ffff7841f21 in ?? () from /lib64/libEGL_nvidia.so.0
#18 0x00007ffff7841f69 in ?? () from /lib64/libEGL_nvidia.so.0
#19 0x00007ffff7834caa in ?? () from /lib64/libEGL_nvidia.so.0
#20 0x00007ffff784329c in ?? () from /lib64/libEGL_nvidia.so.0
#21 0x00007ffff7848976 in ?? () from /lib64/libEGL_nvidia.so.0
#22 0x0000555555560543 in doOneDisplay (d=0x555555637330, name=<optimized out>, opts=...) at ../src/egl/opengl/eglinfo.c:616
#23 0x0000555555558720 in main (argc=<optimized out>, argv=<optimized out>) at ../src/egl/opengl/eglinfo.c:850

The call site for frame 22 is line 616 in /usr/src/debug/mesa-demos-9.0.0-3.fc39.x86_64/redhat-linux-build/../src/egl/opengl/eglinfo.c

614          if (ctx) {
615             if (doOneContext(d, ctx, "OpenGL ES profile", version, opts) == 0)
616                if (!eglDestroyContext(d, ctx))
617                   return 1;
618          }
tim-rex commented 7 months ago

I'm able to isolate a different crash if I call eglinfo -p gbm -a gl (or by specifying any other api via the -a switch)

This execution is able to create an EGL context (core profile 4.6) The crash occurs upon eglTerminate() triggering eGbmTerminateHook()

0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7e438a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7df18ee in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7dd98ff in __GI_abort () at abort.c:79
#4  0x00007ffff7dda7d0 in __libc_message (fmt=fmt@entry=0x7ffff7f5756a "%s\n") at ../sysdeps/posix/libc_fatal.c:150
#5  0x00007ffff7e4d795 in malloc_printerr (str=str@entry=0x7ffff7f55043 "free(): invalid size") at malloc.c:5765
#6  0x00007ffff7e4fa9c in _int_free (av=<optimized out>, p=p@entry=0x516de0, have_lock=have_lock@entry=0) at malloc.c:4504
#7  0x00007ffff7e523ce in __GI___libc_free (mem=0x516df0) at malloc.c:3391
#8  0x00007ffff78458b4 in ?? () from /lib64/libEGL_nvidia.so.0
#9  0x00007ffff7834c4c in ?? () from /lib64/libEGL_nvidia.so.0
#10 0x00007ffff78346ba in ?? () from /lib64/libEGL_nvidia.so.0
#11 0x00007ffff78348c5 in ?? () from /lib64/libEGL_nvidia.so.0
#12 0x00007ffff784916b in ?? () from /lib64/libEGL_nvidia.so.0
#13 0x00007ffff7fbbb26 in eGbmTerminateHook (dpy=<optimized out>) at ../src/gbm-display.c:293
#14 0x00007ffff78abc40 in ?? () from /lib64/libEGL_nvidia.so.0
#15 0x00007ffff7849147 in ?? () from /lib64/libEGL_nvidia.so.0
#16 0x000000000040371e in doOneDisplay (d=0x501390, name=0x430139 "GBM", opts=...) at ../src/egl/opengl/eglinfo.c:681
#17 0x0000000000403f39 in main (argc=5, argv=0x7fffffffdd08) at ../src/egl/opengl/eglinfo.c:902

Ignore the line number of eglinfo.c, I've built this locally myself with additional code to debug.

So, to summarise.. There are two sources of crash.

  1. The initally reported crash with eglinfo -p gbm appears to be first creating an EGL_OPENGL_API context, which succeeds. It then destroys that context and proceeds to create a subsequent EGL_OPENGL_ES_API context, which crashes during eglCreateContext
  2. A different crash occurs with eglinfo -p gbm -a glcore, where it also succeeds to create and use the context, but fails during eglTerminate() somewhere behind eGbmTerminateHook with the stack trace reported in this comment

Please advise if there are any further details I can provide.

tim-rex commented 7 months ago

Just for fun.. while trying to get a minimal repro I'm seeing the crash behaviour changes depending on wether or not I have called eglQueryString(EGL_NO_DISPLAY, EGL_EXTENSIONS); ahead of eglGetPlatformDisplayEXT()

If I query for EGL_EXTENSIONS ahead of eglGetPlatformDisplay, I can see the relevant nVidia modules are loaded

Downloading separate debug info for /lib64/libEGL_nvidia.so.0
Downloading separate debug info for /lib64/libnvidia-glsi.so.535.129.03                                                                                                                                                
Downloading separate debug info for /lib64/libnvidia-eglcore.so.535.129.03                                                                                                                                             
Downloading separate debug info for /lib64/libnvidia-glvkspirv.so.535.129.03                                                                                                                                           
eglGetPlatformDisplayEXT has address f7f9c640                                                                                                                                                                          
Downloading separate debug info for /usr/lib64/gbm/nvidia-drm_gbm.so

The context will be created succesfully. Exiting the process immediately after context creation will produce free(): invalid size

If I instead choose not to query for GL_EXTENSIONS, this is where eglCreateContext encounters corrupted size vs. prev_size

No more updates on this now. Apologies for the noisy thread.

tim-rex commented 7 months ago

This seems related.

Also, off-topic, but that eglinfo crash did end up being a silly bug in the NVIDIA driver. It will be fixed in the next release.

erik-kz commented 7 months ago

Yeah, that should be fixed in the latest driver. Are you installing libnvidia-egl-gbm.so from the main driver package or does your distro package it separately from the GitHub mirror https://github.com/NVIDIA/egl-gbm?

Someone else actually opened a PR with the exact same fix on the public copy https://github.com/NVIDIA/egl-gbm/pull/3

tim-rex commented 7 months ago

It appears to be packaged seperately, direct from the github mirror.

/usr/lib64/libnvidia-egl-gbm.so.1 provided by egl-gbm-1.1.0-5.fc39.x86_64

Name         : egl-gbm
Version      : 1.1.0
Release      : 5.fc39
Architecture : x86_64
Size         : 33 k
Source       : egl-gbm-1.1.0-5.fc39.src.rpm
Repository   : @System
From repo    : fedora-modular
Summary      : Nvidia egl gbm libary
URL          : https://github.com/NVIDIA/egl-gbm
License      : MIT
Description  : Nvidia egl gbm libary

ldd shows it to link against the following

    linux-vdso.so.1 (0x00007ffd0316f000)
    libgbm.so.1 => /lib64/libgbm.so.1 (0x00007f714a40b000)
    libdrm.so.2 => /lib64/libdrm.so.2 (0x00007f714a3f4000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f714a212000)
    libwayland-server.so.0 => /lib64/libwayland-server.so.0 (0x00007f714a1fb000)
    libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f714a1d0000)
    libxcb-randr.so.0 => /lib64/libxcb-randr.so.0 (0x00007f714a1bc000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f714a0db000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f714a440000)
    libffi.so.8 => /lib64/libffi.so.8 (0x00007f714a0cb000)
    libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f714a0a0000)
    libXau.so.6 => /lib64/libXau.so.6 (0x00007f714a09a000)

I've just tried building the PR and it seems to solve the issue for both crash scenarios. eglinfo runs clean

Thanks @erik-kz

tim-rex commented 7 months ago

Unfotunately this did not seem to help the probably unrelated issues mentioned in https://github.com/NVIDIA/egl-wayland/issues/93#issuecomment-1817784724

I'll keep working away on those independently and raise a seperate issue when I have something more concrete/succinct.

tim-rex commented 6 months ago

Can confirm this appears to be resolved with v545.29.06 I think we can close this issue