WayfireWM / wayfire

A modular and extensible wayland compositor
https://wayfire.org/
MIT License
2.35k stars 174 forks source link

segfault when displays power off from being idle too long #2005

Closed k9spud closed 6 months ago

k9spud commented 10 months ago

I have a dual monitor system with two discrete AMD RX 550 video cards. When I leave my system idle for a long time, the monitors automatically power down. When this happens, Wayfire crashes with a segmentation fault. Here's the output of "wayfire -d" when the last monitor powers down and wayfire segfaults:

II 04-11-23 22:56:43.392 - [backend/drm/drm.c:1335] Scanning DRM connector 70 on /dev/dri/card0
II 04-11-23 22:56:43.393 - [backend/drm/drm.c:1419] 'HDMI-A-1' disconnected
II 04-11-23 22:56:43.393 - [wayfire-0.8.0/src/core/output-layout.cpp:1088] remove output: HDMI-A-1
EE 04-11-23 22:56:43.393 - [wayfire-0.8.0/src/core/output-layout.cpp:471] disabling output: HDMI-A-1
II 04-11-23 22:56:43.395 - [wayfire-0.8.0/src/core/output-layout.cpp:144] transfer views from HDMI-A-1 -> HDMI-A-2
II 04-11-23 22:56:43.400 - [backend/drm/drm.c:590] connector HDMI-A-1: Turning off
EE 04-11-23 22:56:43.447 - [wayfire-0.8.0/src/main.cpp:134] Fatal error: Segmentation fault
EE 04-11-23 22:56:43.475 - #1  _start ??:?
EE 04-11-23 22:56:43.484 - #2  __sigaction ??:?
EE 04-11-23 22:56:43.496 - #3  wf::wl_surface_to_wayfire_view(wl_resource*) ??:?
EE 04-11-23 22:56:43.504 - #4  std::_Function_handler<void (void*), wayfire_foreign_toplevel::init_request_handlers()::{lambda(void*)#3}>::_M_invoke(std::_Any_data const&, void*&&) ??:?
EE 04-11-23 22:56:43.517 - #5  wf::wl_listener_wrapper::emit(void*) ??:?
EE 04-11-23 22:56:43.531 - #6  wl_signal_emit_mutable ??:?
EE 04-11-23 22:56:43.540 - #7  wlr_export_dmabuf_manager_v1_create ??:?
EE 04-11-23 22:56:43.547 - #8  ffi_prep_go_closure ??:?
EE 04-11-23 22:56:43.554 - #9  ffi_closure_free ??:?
EE 04-11-23 22:56:43.562 - #10 ffi_call ??:?
EE 04-11-23 22:56:43.569 - #11 wl_event_loop_get_destroy_listener ??:?
EE 04-11-23 22:56:43.577 - #12 wl_client_destroy ??:?
EE 04-11-23 22:56:43.584 - #13 wl_event_loop_dispatch ??:?
EE 04-11-23 22:56:43.600 - #14 wl_display_run ??:?
EE 04-11-23 22:56:43.612 - #15 main ??:?
EE 04-11-23 22:56:43.630 - #16 __libc_init_first ??:?
EE 04-11-23 22:56:43.648 - #17 __libc_start_main ??:?
EE 04-11-23 22:56:43.671 - #18 _start ??:?

I am using Wayfire 0.8.0. These are the plugins enabled:

plugins = \
  follow-focus \
  alpha \
  autostart \
  command \
  expo \
  fast-switcher \
  foreign-toplevel \
  grid \
  gtk-shell \
  idle \
  move \
  decoration \
  oswitch \
  resize \
  switcher \
  vswitch \
  wayfire-shell \
  window-rules \
  wm-actions
k9spud commented 10 months ago

The same seg fault crash seems to sometimes happen when I manually turn off one of my monitors and then turn it back on.

If I have "vblank_mode=0 glxgears" running and displaying on the monitor that I turn off and back on, the likelihood of the crash occurring increases noticebly.

One of my monitors is actually a TV (Insignia 20" 1920x1080). It always take a while for the TV firmware to boot back up when powering back on, and this seems to increase the likelihood of getting Wayfire to segfault when powering the TV back on.

My other monitor is a Dell E2318HR 20" computer monitor. It boots much more quickly when powering it on. The seg fault crash does occassionally occur on this monitor, but the crash is a lot harder to get.

ammen99 commented 10 months ago

It would be nice if you could compile Wayfire with address sanitizer and attach a stacktrace from ASAN (it will generally have much more information).

k9spud commented 10 months ago

I originally thought the problem occurs when the monitor is powering down. But actually, using the glxgears trick to quickly trigger the error instead of waiting around forever for idle power down, I'm seeing Wayland is still running okay when the TV monitor is manually powered off. Wayland is still running fine on the second monitor (that's still on), up until the TV monitor is manually told to turn back on. That's when Wayland crashes (immediately) while the TV monitor is booting up.

It's like something in Wayland thinks everything is good to go on the monitor powering up, but in reality Wayland should wait longer before doing anything with the monitor powering up because that monitor isn't really ready to operate while the TV's firmware is still booting up the display.

AddressSanitizer:DEADLYSIGNAL
=================================================================
==9273==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000100 (pc 0x56150dd8d5b3 bp 0x0ffcace23110 sp 0x7ffd56d26a00 T0)
==9273==The signal is caused by a READ memory access.
==9273==Hint: address points to the zero page.
    #0 0x56150dd8d5b3 in wf::wl_surface_to_wayfire_view(wl_resource*) (/usr/bin/wayfire+0x3375b3)
    #1 0x7fe5447461c5 in std::_Function_handler<void (void*), wayfire_foreign_toplevel::init_request_handlers()::{lambda(void*)#5}>::_M_invoke(std::_Any_data const&, void*&&) (/usr/lib64/wayfire/libforeign-toplevel.so+0x1b1c5)
    #2 0x56150db5fd31 in wf::wl_listener_wrapper::emit(void*) (/usr/bin/wayfire+0x109d31)
    #3 0x7fe56c1f347b in wl_signal_emit_mutable (/usr/lib64/libwayland-server.so.0+0xa47b)
    #4 0x7fe56c15be2d  (/usr/lib64/libwlroots.so.11+0x7ce2d)
    #5 0x7fe56c8afb1d  (/usr/lib64/libffi.so.8+0x7b1d)
    #6 0x7fe56c8aec9a  (/usr/lib64/libffi.so.8+0x6c9a)
    #7 0x7fe56c8af525 in ffi_call (/usr/lib64/libffi.so.8+0x7525)
    #8 0x7fe56c1f7820  (/usr/lib64/libwayland-server.so.0+0xe820)
    #9 0x7fe56c1f24fd  (/usr/lib64/libwayland-server.so.0+0x94fd)
    #10 0x7fe56c1f5651 in wl_event_loop_dispatch (/usr/lib64/libwayland-server.so.0+0xc651)
    #11 0x7fe56c1f2dd4 in wl_display_run (/usr/lib64/libwayland-server.so.0+0x9dd4)
    #12 0x56150db30dd0 in main (/usr/bin/wayfire+0xdadd0)
    #13 0x7fe56b84ca59  (/lib64/libc.so.6+0x23a59)
    #14 0x7fe56b84cb24 in __libc_start_main (/lib64/libc.so.6+0x23b24)
    #15 0x56150db38d50 in _start (/usr/bin/wayfire+0xe2d50)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/usr/bin/wayfire+0x3375b3) in wf::wl_surface_to_wayfire_view(wl_resource*)
==9273==ABORTING
(EE) failed to read Wayland events: Connection reset by peer
A connection to the bus can't be made
~
soreau commented 7 months ago

@k9spud I am wondering if this might be fixed by 3dd16716 because I notice you have idle and not cube in your plugin list. Care to give it a try?

piaccarino commented 6 months ago

If I'm reading https://github.com/WayfireWM/wayfire/commit/3dd16716778735c87fc4ee75eade28cea6757111 correctly it was merged to master on 2024/01/10. I still have this issue just about every morning on version 0.9.0-4f0cc550 (Feb 12 2024, branch 'master') also with an AMD gpu (RX 6700 XT). I'm going to check if there are any amdgpu power settings that would prevent going into this state as a workaround. It is fine returning from DPMS when I step away for less than an hour, not sure after that. I did have this issue on rare occasion with Gnome/Wayland but I don't remember the last time that happened. I will look to see if they figured something out. My debug log looks basically the same but let me know if I can provide more information.

ammen99 commented 6 months ago

A very far-fetched idea of mine as to what could be happening, the following patch might fix it (but somehow I doubt it, because these situations shouldn't happen at all):

diff --git a/src/view/layer-shell/layer-shell.cpp b/src/view/layer-shell/layer-shell.cpp
index 1a24f29b..ea14f6c2 100644
--- a/src/view/layer-shell/layer-shell.cpp
+++ b/src/view/layer-shell/layer-shell.cpp
@@ -447,6 +447,7 @@ std::shared_ptr<wayfire_layer_shell_view> wayfire_layer_shell_view::create(wlr_l

 void wayfire_layer_shell_view::handle_destroy()
 {
+    lsurface->data = nullptr;
     this->lsurface = nullptr;
     on_map.disconnect();
     on_unmap.disconnect();

Otherwise, I am not really sure what could be causing this. The stacktraces from earlier seem to have been generated in release mode, so it would be great if someone could reproduce while testing Wayfire in compiled with debug symbols and address sanitizer at the same time, to get an idea of exactly where the failure is happening.

soreau commented 6 months ago

Also check dmesg after the problem, maybe the gpu is resetting, in which case, a driver upgrade might help.

piaccarino commented 6 months ago

No resets in dmesg or anything out of the ordinary. I have a feeling it is the AMD ultra low power state "feature" that can also cause black screens on return from sleep and hibernate.

I will compile and run whatever you would like to see but I'm going to need some help with the flags for meson with debugging and asan.

soreau commented 6 months ago

No resets in dmesg or anything out of the ordinary. I have a feeling it is the AMD ultra low power state "feature" that can also cause black screens on return from sleep and hibernate.

I will compile and run whatever you would like to see but I'm going to need some help with the flags for meson with debugging and asan.

@piaccarino Thanks. You will want to set meson option -Db_sanitize=address,undefined. Then redirect wayfire output to file, something like wayfire &> ~/wayfire.log. After the problem happens, there should be a backtrace at the end of the log, if wayfire crashed. Upload the log somewhere and post the link here.

piaccarino commented 6 months ago

https://prcl.dev/so1xntxiqlb20ob I just realized that I forgot to add the line layer-shell.cpp. Going to build again and log.

soreau commented 6 months ago

A few things I found with this new backtrace: 1) Apparently this is a problem with animation plugin, so it probably will not happen without animation plugin loaded. 2) I think I was able to reproduce the crash with the wayland backend by clicking the x button to close the window while an animation is happening. 3) Does this patch help at all?

soreau commented 6 months ago

Actually, even though the previous patch may help, after discussing on IRC, @ammen99 came up with this more appropriate patch that will likely be merged, since we can easily reproduce the problem now. Thanks for the backtrace, it helped a great deal.

piaccarino commented 6 months ago

All patches applied, I'll let you know how it goes! Actually just the last one because I stashed the first two and did a clean build, which I imagine is alright. Let me know if you want the earlier ones in or not.

ammen99 commented 6 months ago

Yes, you should need only the last one :)

piaccarino commented 6 months ago

Was able to wake the screens this morning with keyboard input and swaylock-effects was waiting for password as expected. Log shows a successful transition where it previously faulted. I'll have at least one more in the next few hours but otherwise I'll give it a few more days with logging to see if anything else pops out.

II 14-02-24 23:48:28.758 - [backend/drm/drm.c:786] connector DP-1: Turning off
II 14-02-24 23:48:28.864 - [backend/drm/drm.c:786] connector DP-2: Turning off
EE 15-02-24 04:41:56.572 - [src/view/xwayland/xwayland-unmanaged-view.hpp:137] new unmanaged xwayland surface (null) class: (null) instance: (null)
EE 15-02-24 04:41:56.665 - [src/view/xwayland.cpp:71] new xwayland surface notificationtoasts_2_desktop class: steam instance: steamwebhelper
II 15-02-24 06:17:00.547 - [backend/drm/drm.c:782] connector DP-1: Modesetting with 5120x1440 @ 120.000 Hz
II 15-02-24 06:17:00.618 - [backend/drm/drm.c:782] connector DP-2: Modesetting with 5120x1440 @ 120.000 Hz
II 15-02-24 06:17:03.193 - [backend/drm/drm.c:1544] Scanning DRM connector 113 on /dev/dri/card1
II 15-02-24 06:17:03.193 - [backend/drm/drm.c:1631] 'DP-1' disconnected
II 15-02-24 06:17:03.193 - [src/core/output-layout.cpp:1177] remove output: DP-1
EE 15-02-24 06:17:03.193 - [src/core/output-layout.cpp:514] disabling output: DP-1
II 15-02-24 06:17:03.211 - [src/core/output-layout.cpp:145] transfer views from DP-1 -> DP-2
II 15-02-24 06:17:03.246 - [backend/drm/drm.c:786] connector DP-1: Turning off
II 15-02-24 06:17:03.370 - [backend/drm/drm.c:782] connector DP-2: Modesetting with 5120x1440 @ 120.000 Hz
II 15-02-24 06:17:03.814 - [backend/drm/drm.c:1544] Scanning DRM connector 113 on /dev/dri/card1
II 15-02-24 06:17:03.822 - [backend/drm/drm.c:1623] 'DP-1' connected
II 15-02-24 06:17:03.822 - [backend/drm/drm.c:1432] Detected modes:
II 15-02-24 06:17:03.822 - [backend/drm/drm.c:1459]   3840x2160 @ 60.000 Hz (preferred)
II 15-02-24 06:17:03.822 - [backend/drm/drm.c:1459]   5120x1440 @ 60.000 Hz (preferred)