ValveSoftware / gamescope

SteamOS session compositing window manager
Other
3.2k stars 214 forks source link

gamescope crashes on steamdeck when pulling up the keyboard (steam+x) on dwarf fortress #724

Open kelvie opened 1 year ago

kelvie commented 1 year ago

More details (2 core dumps attached to the steam-for-linux ticket):

https://github.com/ValveSoftware/SteamOS/issues/945

Here's a stack trace https://gist.github.com/kelvie/8ccffb3bddf53c6bbf7618295b789b96

It seems It's happening to more than just me:

https://old.reddit.com/r/SteamDeck/comments/zv5sys/anyone_else_getting_a_full_system_crash_when/

kelvie commented 1 year ago

Looks like it's in wlroots? assertion=0x56484c13b028 "wl_resource_instance_of(resource, &wl_surface_interface, &surface_implementation)", file=0x56484c137ce7 "types/wlr_surface.c", line=612, function=0x56484c13c160 "wlr_surface_from_resource") at assert.c:101

Using whatever debug symbols I can find from debuginfod.elfutils.org:

Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `gamescope --generate-drm-mode fixed --xwayland-count 2 -w 1280 -h 800 --default'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7f01a77fe6c0 (LWP 1233))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
ValveSoftware/steam-for-linux#1  0x00007f01c24f96b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
ValveSoftware/steam-for-linux#2  0x00007f01c24a9958 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
ValveSoftware/steam-for-linux#3  0x00007f01c249353d in __GI_abort () at abort.c:79
ValveSoftware/steam-for-linux#4  0x00007f01c249345c in __assert_fail_base
    (fmt=0x7f01c260da50 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x56484c13b028 "wl_resource_instance_of(resource, &wl_surface_interface, &surface_implementation)", file=0x56484c137ce7 "types/wlr_surface.c", line=612, function=<optimized out>) at assert.c:92
ValveSoftware/steam-for-linux#5  0x00007f01c24a2486 in __GI___assert_fail
    (assertion=0x56484c13b028 "wl_resource_instance_of(resource, &wl_surface_interface, &surface_implementation)", file=0x56484c137ce7 "types/wlr_surface.c", line=612, function=0x56484c13c160 "wlr_surface_from_resource") at assert.c:101
ValveSoftware/steam-for-linux#6  0x000056484c0c2d7a in  ()
ValveSoftware/steam-for-linux#7  0x000056484c097eec in  ()
ValveSoftware/steam-for-linux#8  0x000056484c09bbc7 in  ()
ValveSoftware/steam-for-linux#9  0x00007f01c28382f3 in std::execute_native_thread_routine(void*) (__p=0x56484f261230) at /usr/src/debug/gcc/libstdc++-v3/src/c++11/thread.cc:82
ValveSoftware/steam-for-linux#10 0x00007f01c24f78fd in start_thread (arg=<optimized out>) at pthread_create.c:442
ValveSoftware/steam-for-linux#11 0x00007f01c2579a60 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) 
kelvie commented 1 year ago

I can pretty reliably trigger this, if you can provide the debug symbols or a new gamescope binary/libraries (for steam deck version 3.4.2)

kelvie commented 1 year ago

And looking at the code, it seems like the only function in wlroots that calls it like this is wlr_surface_from_resource, which is called in 3 places:

gamescope_xwayland_server_t::set_wl_id
gamescope_xwayland_server_t::handle_override_window_content
gamescope_tearing_get_tearing_control 
kelvie commented 1 year ago

If I had to guess it'd be the handle_override_window_content call 😁

Looks like it gets called in VkLayer_FROG_gamescope_wsi.cpp, but I'd have to learn more about what these layers do, but presumably it's getting called with a non surface resource (and this is happening during of a creating of the keyboard overlay over the xwayland window that runs the game?)

kelvie commented 1 year ago

Reading the handle_override_window_content code, it's probably not that -- the only place I see it being called, it creates a new surface, checks for NULL, then passes it into handle_override_window_content, so it's probably a good surface.

Perhaps it's set_wl_id, which is called here:

https://github.com/Plagman/gamescope/blob/f863708a1f06ca0bb7b35f7b34d5abe4961eddbb/src/steamcompmgr.cpp#L3717

Maybe we need to check if the surface is valid before doing this?

This is being called in an X11 message handler:

https://github.com/Plagman/gamescope/blob/f863708a1f06ca0bb7b35f7b34d5abe4961eddbb/src/steamcompmgr.cpp#L3846

kelvie commented 1 year ago

Are there build instructions for how to build gamescope so that it works on steamOS? I tried building it myself from master, and SteamOS didn't launch for whatever reason. If I can reproduce the crash with debug symbols, I can dig deeper into this.

kelvie commented 1 year ago

Ah, I was building the wrong branch. I built it from jupiter/3.4, and reproduced this inside GDB.

(gdb) bt
u#0  0x00007f3c35d6e64c in  () at /usr/lib/libc.so.6
#1  0x00007f3c35d1e958 in raise () at /usr/lib/libc.so.6
#2  0x00007f3c35d0853d in abort () at /usr/lib/libc.so.6
#3  0x00007f3c35d0845c in  () at /usr/lib/libc.so.6
#4  0x00007f3c35d17486 in  () at /usr/lib/libc.so.6
#5  0x00005634211e1cbf in wlr_surface_from_resource (resource=0x56342245db20) at ../subprojects/wlroots/types/wlr_surface.c:612
#6  0x000056342115f4ee in gamescope_xwayland_server_t::set_wl_id(wlserver_x11_surface_info*, unsigned int)
    (this=0x563422841190, surf=0x7f3c140a87c8, id=80) at ../src/wlserver.cpp:1296
#7  0x000056342113d054 in handle_wl_surface_id(xwayland_ctx_t*, win*, uint32_t) (ctx=0x7f3c14000f30, w=0x7f3c140a86b0, surfaceID=80)
    at ../src/steamcompmgr.cpp:3675
#8  0x000056342113d52f in handle_client_message(xwayland_ctx_t*, XClientMessageEvent*) (ctx=0x7f3c14000f30, ev=0x7f3c1bffe920)
    at ../src/steamcompmgr.cpp:3803
#9  0x00005634211413a8 in dispatch_x11(xwayland_ctx_t*) (ctx=0x7f3c14000f30) at ../src/steamcompmgr.cpp:4978
#10 0x000056342114396a in steamcompmgr_main(int, char**) (argc=28, argv=0x7ffc926e1a78) at ../src/steamcompmgr.cpp:5567
#11 0x000056342115b7d9 in steamCompMgrThreadRun(int, char**) (argc=28, argv=0x7ffc926e1a78) at ../src/main.cpp:602
#12 0x000056342115bf21 in std::__invoke_impl<void, void (*)(int, char**), int, char**>(std::__invoke_other, void (*&&)(int, char**), int&&, char**&&) (__f=@0x5634227fbd18: 0x56342115b79f <steamCompMgrThreadRun(int, char**)>) at /usr/include/c++/12.2.0/bits/invoke.h:61
#13 0x000056342115be62 in std::__invoke<void (*)(int, char**), int, char**>(void (*&&)(int, char**), int&&, char**&&)
     (__fn=@0x5634227fbd18: 0x56342115b79f <steamCompMgrThreadRun(int, char**)>) at /usr/include/c++/12.2.0/bits/invoke.h:96
#14 0x000056342115bd95 in std::thread::_Invoker<std::tuple<void (*)(int, char**), int, char**> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>) (this=0x5634227fbd08) at /usr/include/c++/12.2.0/bits/std_thread.h:252
#15 0x000056342115bd32 in std::thread::_Invoker<std::tuple<void (*)(int, char**), int, char**> >::operator()() (this=0x5634227fbd08)
    at /usr/include/c++/12.2.0/bits/std_thread.h:259
#16 0x000056342115bd16 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(int, char**), int, char**> > >::_M_run()
    (this=0x5634227fbd00) at /usr/include/c++/12.2.0/bits/std_thread.h:210
#17 0x00007f3c360ad2f3 in std::execute_native_thread_routine(void*) (__p=0x5634227fbd00) at /usr/src/debug/gcc/libstdc++-v3/src/c++11/thread.cc:82
#18 0x00007f3c35d6c8fd in  () at /usr/lib/libc.so.6
#19 0x00007f3c35deea60 in  () at /usr/lib/libc.so.6

Looks like the resource passed is:

(gdb) up 5
#5  0x00005634211e1cbf in wlr_surface_from_resource (resource=0x56342245db20) at ../subprojects/wlroots/types/wlr_surface.c:612
612     ../subprojects/wlroots/types/wlr_surface.c: No such file or directory.
(gdb) print resource
$1 = (struct wl_resource *) 0x56342245db20
(gdb) print *resource
$3 = {object = {interface = 0x5634212a0f20 <gamescope_surface_tearing_control_v1_interface>, 
    implementation = 0x5634212a1060 <surface_tearing_control_impl>, id = 80}, destroy = 0x0, link = {prev = 0x0, next = 0x0}, destroy_signal = {
    listener_list = {prev = 0x56342245db50, next = 0x56342245db50}}, client = 0x5634225ef820, data = 0x5634217dddb0}
(gdb) 

This is the check that gets asserted (and aborts):

https://chromium.googlesource.com/external/wayland/wayland/+/refs/heads/1.5/src/wayland-server.c#627

From the code, it looks like it's expecting a surface_tearing_control_impl, and not a gamescope_surface_tearing_control_v1_interface

Digging at the handler, it seems this reacts to the WL_SURFACE_ID message, which leads to this:

https://gitlab.freedesktop.org/xorg/xserver/-/issues/1157

I guess this is what's happening, and it appears it was fixed by https://gitlab.freedesktop.org/xorg/xserver/-/merge_requests/976 which was merged 3 months ago?

kelvie commented 1 year ago

Oh, looking at that issue, it was reported by one of the gamescope devs, @emersion -- is this what we're seeing here?

kelvie commented 1 year ago

OK, and even if we had the latest xserver, it looks like all it does is implement a new protocol, we still need to deprecate the use of WL_SURFACE_ID in gamescope to avoid this race.

misyltoad commented 1 year ago

I am assuming you have Decky Loader or something installed on your Steam Deck which is the culprit that triggers this behaviour.

misyltoad commented 1 year ago

We should move to the new system in Gamescope, but the pieces have just landed.

kelvie commented 1 year ago

@Joshua-Ashton I don't have decky loader or any of that nonsense, this is pretty stock. I do have one of the official keyboard themes though.

misyltoad commented 1 year ago

Hmmm, that's interesting then. We should try and ship the fix in Gamescope soon either way.

kelvie commented 1 year ago

Thank you. Is there a workaround, like "wait for the keyboard to close for 5 seconds" or something? Presumably this is the keyboard window getting destroyed so fast that it's ID is getting re-used again right? (peeking at the CPU in top while SSH'd in, popping up the keyboard seems to use a lot of CPU).

kelvie commented 1 year ago

If anyone else (like me) wants to play dwarf fortress on their steam deck for however much time off they have left, I put together a hack to make this happen less: https://github.com/kelvie/gamescope/releases/tag/jupiter-3.4-kelvie

kelvie commented 1 year ago

Any progress with the fix? Just as a side note, my patch has not encountered the keyboard crash even once in many more hours of playing and popping the keyboard up and down, so it may be worth pushing it in the interim as it seems like this bug affects a lot of other people.

uramer commented 1 year ago

This issue has made on-screen keyboard completely unusable in games for me, which in turn made many games unplayable. So yes, it would be great to have even a hacky workaround fix.

kelvie commented 1 year ago

I think I see the changes have landed on master -- is there any indication when this will land in a steam deck update? I do understand there are a bunch of upstream dependencies that need to be sorted first.