NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 44 forks source link

failed to lock pthread mutex #27

Closed r3pek closed 2 years ago

r3pek commented 4 years ago

Hi!

I'm hitting a segfault while trying to open Evolution (Mail client) under wayland. I've reported the segfault on the Fedora bugzilla [1] and since it looks like they can't track it down, we decided to report it here.

Here's the stack trace:

(gdb) bt full
#0  0x00007ffff3a9c625 in raise () at /lib64/libc.so.6
#1  0x00007ffff3a858d9 in abort () at /lib64/libc.so.6
#2  0x00007ffff3a857a9 in _nl_load_domain.cold () at /lib64/libc.so.6
#3  0x00007ffff3a94a66 in annobin_assert.c_end () at /lib64/libc.so.6
#4  0x00007fffe4109b7d in wlExternalApiLock () at ../src/wayland-thread.c:87
        __PRETTY_FUNCTION__ = "wlExternalApiLock"
#5  0x00007fffe410e4ab in wlEglGetInternalHandleExport (dpy=0x5555566dad60, type=13233, handle=0x5555566dad60) at ../src/wayland-eglhandle.c:146
#6  0x00007fffd65574ef in  () at /lib64/libEGL_nvidia.so.0
#7  0x00007fffd64deeeb in  () at /lib64/libEGL_nvidia.so.0
#8  0x00007fffe410b752 in wl_eglstream_display_bind (data=data@entry=0x5555566cc5c0, wlDisplay=wlDisplay@entry=0x55555649b360, eglDisplay=eglDisplay@entry=0x5555566dad60)
    at ../src/wayland-eglstream-server.c:311
        wlStreamDpy = 0x555556b69f90
        exts = 0x0
        env = 0x0
#9  0x00007fffe410a355 in wlEglBindDisplaysHook (data=0x5555566cc5c0, dpy=0x5555566dad60, nativeDpy=0x55555649b360) at ../src/wayland-egldisplay.c:87
        res = 0
#10 0x00007fffd65533f3 in  () at /lib64/libEGL_nvidia.so.0
#11 0x00007fffd64db775 in  () at /lib64/libEGL_nvidia.so.0
#12 0x00007ffff20f5b11 in WS::Instance::initialize(void*) () at /lib64/libWPEBackend-fdo-1.0.so.1
#13 0x00007ffff49c7bf6 in WebKit::WebProcessPool::platformInitializeWebProcess(WebKit::WebProcessProxy const&, WebKit::WebProcessCreationParameters&) (this=this@entry=0x7fffe42ee000, process=
    ..., parameters=...) at ../Source/WebKit/UIProcess/glib/WebProcessPoolGLib.cpp:119
#14 0x00007ffff489adfa in WebKit::WebProcessPool::initializeNewWebProcess(WebKit::WebProcessProxy&, WebKit::WebsiteDataStore*, WebKit::WebProcessProxy::IsPrewarmed)
    (this=<optimized out>, process=..., websiteDataStore=0x7fffe42e4000, isPrewarmed=WebKit::WebProcessProxy::IsPrewarmed::No) at ../Source/WebKit/UIProcess/WebProcessPool.cpp:1044
        initializationActivity = {m_ref = std::unique_ptr<WebKit::ProcessThrottler::Activity<(WebKit::ProcessThrottler::ActivityType)0>> = {get() = 0x0}}
        parameters = <snip here>

If you need any information, i'll gladly help.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1842473

erik-kz commented 4 years ago

Sorry for the slow response, and thank you very much for reporting the issue. I believe the problem is that we're calling eglQueryString from wl_eglstream_display_bind while holding the external API lock which leads to a recursive acquire. However, I'm still trying to figure out why this only seems to affect webkit. I'll investigate a bit more and try to get a patch out as soon as possible.

erik-kz commented 3 years ago

This should be fixed by 9558ec02d0f7bbf30dc1f9ee4c0b06c9b0c49afe

ghost commented 2 years ago

:/

[ammako@arch ~]$ gdb evolution
GNU gdb (GDB) 11.1
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from evolution...
(No debugging symbols found in evolution)
(gdb) run
Starting program: /usr/bin/evolution 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffeb8fa640 (LWP 2587)]
[New Thread 0x7fffeb0e5640 (LWP 2588)]
[New Thread 0x7fffea8c2640 (LWP 2589)]
[New Thread 0x7fffea0bf640 (LWP 2590)]
[New Thread 0x7fffe8fca640 (LWP 2592)]
[New Thread 0x7fffd3fff640 (LWP 2595)]
[Detaching after fork from child process 2596]
[New Thread 0x7fffd37fe640 (LWP 2598)]
[New Thread 0x7fffd2ffd640 (LWP 2599)]
[New Thread 0x7fffd27fc640 (LWP 2600)]
evolution: ../egl-wayland/src/wayland-thread.c:87: wlExternalApiLock: Assertion `!"failed to lock pthread mutex"' failed.

Thread 1 "evolution" received signal SIGABRT, Aborted.
0x00007ffff6dafd22 in raise () from /usr/lib/libc.so.6
(gdb) bt full
#0  0x00007ffff6dafd22 in raise () at /usr/lib/libc.so.6
#1  0x00007ffff6d99862 in abort () at /usr/lib/libc.so.6
#2  0x00007ffff6d99747 in _nl_load_domain.cold () at /usr/lib/libc.so.6
#3  0x00007ffff6da8616 in  () at /usr/lib/libc.so.6
#4  0x00007fffe852791c in  () at /usr/lib/libnvidia-egl-wayland.so.1
#5  0x00007fffe85282da in  () at /usr/lib/libnvidia-egl-wayland.so.1
#6  0x00007fffe8529178 in  () at /usr/lib/libnvidia-egl-wayland.so.1
#7  0x00007fffe8160195 in  () at /usr/lib/libEGL_nvidia.so.0
#8  0x00007fffe8102382 in  () at /usr/lib/libEGL_nvidia.so.0
#9  0x00007fffe8527ba4 in  () at /usr/lib/libnvidia-egl-wayland.so.1
#10 0x00007fffe8165649 in  () at /usr/lib/libEGL_nvidia.so.0
#11 0x00007fffe8104cfa in  () at /usr/lib/libEGL_nvidia.so.0
#12 0x00007fffefa5d5bb in wpe_fdo_initialize_for_egl_display ()
    at /usr/lib/libWPEBackend-fdo-1.0.so.1
#13 0x00007ffff3d25569 in  () at /usr/lib/libwebkit2gtk-4.0.so.37
#14 0x00007ffff3d30ccc in  () at /usr/lib/libwebkit2gtk-4.0.so.37
#15 0x00007ffff3d30d56 in  () at /usr/lib/libwebkit2gtk-4.0.so.37
#16 0x00007ffff3d3714a in  () at /usr/lib/libwebkit2gtk-4.0.so.37
#17 0x00007ffff3bbe924 in  () at /usr/lib/libwebkit2gtk-4.0.so.37
#18 0x00007ffff3c595f5 in  () at /usr/lib/libwebkit2gtk-4.0.so.37
#19 0x00007ffff723184a in g_type_create_instance ()
    at /usr/lib/libgobject-2.0.so.0
#20 0x00007ffff7219306 in  () at /usr/lib/libgobject-2.0.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#21 0x00007ffff721a79b in g_object_new_valist () at /usr/lib/libgobject-2.0.so.0
#22 0x00007ffff3c52815 in webkit_settings_new_with_settings () at /usr/lib/libwebkit2gtk-4.0.so.37
#23 0x00007ffff712da38 in e_web_view_get_default_webkit_settings () at /usr/lib/evolution/libevolution-util.so
#24 0x00007ffff712db14 in  () at /usr/lib/evolution/libevolution-util.so
#25 0x00007ffff7219258 in  () at /usr/lib/libgobject-2.0.so.0
#26 0x00007ffff721a79b in g_object_new_valist () at /usr/lib/libgobject-2.0.so.0
#27 0x00007ffff721acfa in g_object_new () at /usr/lib/libgobject-2.0.so.0
#28 0x00007fffe92e2e4a in  () at /usr/lib/evolution/libevolution-mail.so
#29 0x00007ffff72193ef in  () at /usr/lib/libgobject-2.0.so.0
#30 0x00007ffff721a79b in g_object_new_valist () at /usr/lib/libgobject-2.0.so.0
#31 0x00007ffff721acfa in g_object_new () at /usr/lib/libgobject-2.0.so.0
#32 0x00007fffe9517980 in  () at /usr/lib/evolution/modules/module-mail.so
#33 0x00007ffff72193ef in  () at /usr/lib/libgobject-2.0.so.0
#34 0x00007ffff721a79b in g_object_new_valist () at /usr/lib/libgobject-2.0.so.0
#35 0x00007ffff721acfa in g_object_new () at /usr/lib/libgobject-2.0.so.0
#36 0x00007ffff7faa9ba in  () at /usr/lib/evolution/libevolution-shell.so
#37 0x00007fffe951b8c9 in  () at /usr/lib/evolution/modules/module-mail.so
#38 0x00007ffff72193ef in  () at /usr/lib/libgobject-2.0.so.0
#39 0x00007ffff721a79b in g_object_new_valist () at /usr/lib/libgobject-2.0.so.0
#40 0x00007ffff721acfa in g_object_new () at /usr/lib/libgobject-2.0.so.0
#41 0x00007ffff7fad2d9 in  () at /usr/lib/evolution/libevolution-shell.so
#42 0x00007ffff7facbf5 in e_shell_window_get_shell_view () at /usr/lib/evolution/libevolution-shell.so
#43 0x00007ffff7fadc07 in e_shell_window_set_active_view () at /usr/lib/evolution/libevolution-shell.so
#44 0x00007ffff7218fdf in  () at /usr/lib/libgobject-2.0.so.0
#45 0x00007ffff721ae8c in g_object_setv () at /usr/lib/libgobject-2.0.so.0
#46 0x00007ffff721af6c in g_object_set_property () at /usr/lib/libgobject-2.0.so.0
#47 0x00007ffff733c615 in  () at /usr/lib/libgio-2.0.so.0
#48 0x00007ffff733cf6a in g_settings_bind_with_mapping () at /usr/lib/libgio-2.0.so.0
#49 0x00007ffff733d58b in g_settings_bind () at /usr/lib/libgio-2.0.so.0
#50 0x00007ffff7fb05e2 in e_shell_window_private_constructed () at /usr/lib/evolution/libevolution-shell.so
#51 0x00007ffff7fac4ff in  () at /usr/lib/evolution/libevolution-shell.so
#52 0x00007ffff72193ef in  () at /usr/lib/libgobject-2.0.so.0
#53 0x00007ffff721a79b in g_object_new_valist () at /usr/lib/libgobject-2.0.so.0
#54 0x00007ffff721acfa in g_object_new () at /usr/lib/libgobject-2.0.so.0
#55 0x00007ffff7fac5fb in e_shell_window_new () at /usr/lib/evolution/libevolution-shell.so
#56 0x00007ffff7f9bac9 in e_shell_create_shell_window () at /usr/lib/evolution/libevolution-shell.so
#57 0x0000555555558940 in  ()
#58 0x00007ffff7d363e5 in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0
#59 0x00007ffff7d8a749 in  () at /usr/lib/libglib-2.0.so.0
#60 0x00007ffff7d35a63 in g_main_loop_run () at /usr/lib/libglib-2.0.so.0
#61 0x00007ffff770b86f in gtk_main () at /usr/lib/libgtk-3.so.0
#62 0x00005555555586b3 in main ()
[ammako@arch ~]$ pacman -Qs nvidia
local/egl-wayland 1:1.1.9-1
    EGLStream-based Wayland external platform
local/lib32-nvidia-utils-beta 495.29.05-1
    NVIDIA drivers utilities (32-bit, beta version)
local/lib32-opencl-nvidia-beta 495.29.05-1
    OpenCL implemention for NVIDIA (32-bit, beta version)
local/libvdpau 1.4-1
    Nvidia VDPAU library
local/nvidia-beta-dkms 495.29.05-1
    NVIDIA driver sources for linux (beta version)
local/nvidia-settings-beta 495.29.05-1
    Tool for configuring the NVIDIA graphics driver (beta version)
local/nvidia-utils-beta 495.29.05-1
    NVIDIA drivers utilities (beta version)
local/opencl-nvidia-beta 495.29.05-1
    OpenCL implemention for NVIDIA (beta version)

I've got an alias for evolution='GDK_BACKEND=x11 evolution' which is good enough for now, but it would be nice to be able to run it on wayland native eventually.

ghost commented 2 years ago

Works totally fine with 1.1.7 egl-wayland and 470.74 drivers btw

erik-kz commented 2 years ago

Thanks for catching this. Looking at the backtrace with debug symbols, https://github.com/NVIDIA/egl-wayland/commit/6c12c934f82b0944805b2690390499de3b2fa859 appears to have caused the regression. Re-opening the issue.

andrewathalye commented 2 years ago

Yep, can confirm this regression on Gentoo as well using 1.9-1.

andrewathalye commented 2 years ago

If you disable the assertion and add some debug logic, it looks like the lock / unlock process fails twice after wlEglAcquireDisplay and wlEglReleaseDisplay respectively. After the initial evolution launch, no further mutex issues occur from my limited testing and Wayland functionality is 100%. This is definitely not a "fix", but rather a simple workaround for Evolution specifically. I fear that there might be some sort of race condition, if possible, causing the mutex to get unlocked by the wrong thread.

log.txt

This is a full log of evolution starting and closing. Sorry about not including anything else useful, I'm not super familiar with C debugging.

3kinox commented 2 years ago

I have the same bug on gnome-boxes, Fedora 35.GDK_BACKEND=x11 Also fixes the issue here.

cubanismo commented 2 years ago

Thanks for the log, that's helpful.

paveloom commented 2 years ago

Can confirm the regression on Fedora 35 + GNOME + Wayland + Nvidia.

gunnarhj commented 2 years ago

Issue confirmed on Ubuntu's development version as well (yelp in my case).

hbkfabio commented 2 years ago

Can confirm too on Fedora 35 GNOME Wayland session and Nvidia driver

ananthb commented 2 years ago

Same issue on Arch Linux with egl-wayland (1:1.1.9+2+gdaab854-1) and nvidia (495.44-9).

ghost commented 2 years ago

We know. 1.1.9 is broken. That's it. We dont need to get pinged via email every time someone runs into the known issue.

Please refrain from commenting unless you have useful information to share towards fixing the issue, and be patient.

tannisroot commented 2 years ago

Our app is affected by this and we are tempted to include the GDK_BACKEND=x11 workaround in the next release, would that be a good idea or should we just wait for a fix, if it's coming soon-ish (a few weeks or so)?

ghost commented 2 years ago

Depends how you implement it. Check for nvidia, if the user is running nvidia then check if x11 or wayland. If wayland, check egl-wayland version, if 1.1.9 then have the code apply the workaround. That way the fix will only be applied when on the specific driver and package versions that are problematic, while leaving everything else unaffected.

erik-kz commented 2 years ago

I have a fix prepared and just got sign-off on internal code review. I'll upload it to this GitHub repo early next week.

erik-kz commented 2 years ago

Fixed by 582b2d345abaa0e313cf16c902e602084ea59551

MrTomRod commented 2 years ago

Fix confirmed! Thanks a lot! This is how to install this version of egl-wayland (1.1.9-3) on Fedora 35:

sudo dnf update --enablerepo=updates-testing egl-wayland

andrewathalye commented 2 years ago

Confirmed on Gentoo as well. >=1.1.8 is currently hardmasked due to plasma issues, but if you have 1.1.9 installed and want to apply these patches, the below patch (rename to .patch) pthread_mutex.txt can be placed in /etc/portage/patches/gui-libs/egl-wayland/. Note that you'll probably want to remove this after Gentoo updates (most likely whenever 1.2.0 / 1.1.10 is released here).

gunnarhj commented 2 years ago

On Debian testing and unstable as well as Ubuntu's development version (coming 22.04), version 1:1.1.9-1.1 with the fix applied is available:

sudo apt update
sudo apt install libnvidia-egl-wayland1