elementary / gala

Gala Window Manager for elementary OS and Pantheon
https://elementary.io
GNU General Public License v3.0
275 stars 76 forks source link

Thinkpad X280 and i915 driver: Screen freezes for seconds occasionally #2023

Closed Letterus closed 2 months ago

Letterus commented 3 months ago

What Happened?

Using the current OS 8 preview/daily, Wayland session, ThinkPad X280, Intel graphics, I observe a complete freeze of the whole screen (including the mouse pointer) for seconds. During this time it just hangs and would not change output, but seems to be processing the last action in the background and - after the freeze - present the result (e.g. scrolling down, opening a new window). This observation is not bound to a certain behaviour/triggering action and appears from time to time. It seems to me I experienced the same behaviour on my X250 using X11 session with OS 6.1/7.1 as well.

Any hints on how to debug this? I did not find any helpful messages in /var/log/syslog for the time this issue arose.

Steps to Reproduce

Just use the Pantheon desktop. Behaviour does not seem to relate to any specific action.

I am using the Wayland session, I have enabled fractional scaling and set it to 125%

Expected Behavior

The UI/screen should not freeze.

OS Version

8.x (Early Access)

Software Version

Latest release (I have run all updates)

Log Output

No response

Hardware Info

ThinkPad X280, Intel graphics.

leolost2605 commented 3 months ago

Just noting things down for future reference:

I currently don't and never have observed any of the described behavior on gala/wingpanel/dock main so maybe a hardware issue?

However during development I've seen similar symptoms arise when misusing the pantheon wayland protocol from the client side.

Letterus commented 3 months ago

@leolost2605 Thanks for commenting. Think I experienced the same with X11 and the X250 at least. Don't know if there's anything specific about the Lenovo X series architecture.

Do you think any relation to #2024 is possible, some kind of overflow?

Letterus commented 3 months ago

Further note: I am/was running the Nextcloud desktop client on both machines. Currently it seems the freezes don't occur when I quit the client, but occour more often when I edit and save or move a local file synced via Nextcloud. Does this sound reasonable to you?

Edit: After checking this twice this really seems related to the Nextcloud desktop client. But don't know where to start debugging yet.

Letterus commented 3 months ago

I'm not experiencing this issue anymore since the last updates. I don't know if this is by coincidence. But I'm closing it for now and going to reopen it in case it occurs again.

Letterus commented 3 months ago

Reopening this as occasional hangs keep occuring. I think it's not only the Nextcloud client but tasks with heavy IO that lead to the screen becoming stuck for some seconds. I don't know the code, but it seems to me there is some piece connected with IO that's not working async.

leolost2605 commented 3 months ago

Hmm that could very well be. I think KDE had a similar problem about doing heavy caching. Are you running an HDD by chance?

Letterus commented 3 months ago

Nope, only SSD.

Is there a good way of debugging? Which place could I start digging into the code and maybe set some debug messages?

Letterus commented 3 months ago

Found the following log messages in /var/log/syslog close to the last hang:

2024-08-28T11:25:17.419787+02:00 XinkPad280 geoclue[1485]: Failed to query location: Query location SOUP error: Not Found
2024-08-28T11:26:21.179612+02:00 XinkPad280 kernel: workqueue: delayed_fput hogged CPU for >13333us 128 times, consider switching to WQ_UNBOUND
2024-08-28T11:27:56.455675+02:00 XinkPad280 geoclue[1485]: message repeated 4 times: [ Failed to query location: Query location SOUP error: Not Found]
2024-08-28T11:28:10.571720+02:00 XinkPad280 zeitgeist-datah[2119]: zeitgeist-datahub.vala:210: Error during inserting events: GDBus.Error:org.gnome.zeitgeist.EngineError.InvalidArgument: Incomplete event: interpretation, manifestation and actor are required

Don't know if any of these may cause the issue? geoclue or zeitgeist-datahub?

Letterus commented 3 months ago

During next hang appeared again:

2024-08-28T11:50:27.210917+02:00 XinkPad280 zeitgeist-datah[2141]: zeitgeist-datahub.vala:210: Error during inserting events: GDBus.Error:org.gnome.zeitgeist.EngineError.InvalidArgument: Incomplete event: interpretation, manifestation and actor are required
Letterus commented 2 months ago

Edit: It further freezes. Even without zeitgeist-datahub running.

Letterus commented 2 months ago

I freshly installed and just started GNOME Contacts, which had to load quite some addressbooks and lots of contacts of mine - and the whole screen froze again for quite some seconds. It seems to be related to IO, but it may be some synchronous waits as well as Zeitgeist as Evolution Data Server.

Letterus commented 2 months ago

I made it freeze again by using Starfish app and opening a domain (that was somehow hanging and using lots of CPU cycles which lead to the "app is not answering do you want to kill it?" dialogue).

Log:

2024-08-30T09:21:18.814890+02:00 XinkPad280 gala[1785]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
2024-08-30T09:21:43.049638+02:00 XinkPad280 xdg-desktop-por[2351]: g_application_get_resource_base_path: assertion 'G_IS_APPLICATION (application)' failed
2024-08-30T09:21:43.190290+02:00 XinkPad280 xdg-desktop-por[2351]: GtkDialog mapped without a transient parent. This is discouraged.
2024-08-30T09:21:43.199444+02:00 XinkPad280 gala[1785]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
2024-08-30T09:21:43.199673+02:00 XinkPad280 gala[1785]: clutter_actor_set_child_below_sibling: assertion 'child->priv->parent == self' failed
2024-08-30T09:21:43.199761+02:00 XinkPad280 gala[1785]: clutter_actor_set_child_above_sibling: assertion 'child->priv->parent == self' failed
2024-08-30T09:21:43.199874+02:00 XinkPad280 gala[1785]: message repeated 2 times: [ clutter_actor_set_child_above_sibling: assertion 'child->priv->parent == self' failed]
2024-08-30T09:21:43.303313+02:00 XinkPad280 gala[1785]: WindowManager.vala:916: No transient found
2024-08-30T09:21:50.461054+02:00 XinkPad280 xdg-desktop-por[2351]: GtkDialog mapped without a transient parent. This is discouraged.
2024-08-30T09:21:50.470182+02:00 XinkPad280 gala[1785]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
2024-08-30T09:21:50.470628+02:00 XinkPad280 gala[1785]: clutter_actor_set_child_below_sibling: assertion 'child->priv->parent == self' failed
2024-08-30T09:21:50.470817+02:00 XinkPad280 gala[1785]: clutter_actor_set_child_above_sibling: assertion 'child->priv->parent == self' failed
2024-08-30T09:21:50.471103+02:00 XinkPad280 gala[1785]: message repeated 2 times: [ clutter_actor_set_child_above_sibling: assertion 'child->priv->parent == self' failed]
2024-08-30T09:21:50.514123+02:00 XinkPad280 gala[1785]: WindowManager.vala:916: No transient found
2024-08-30T09:21:57.936407+02:00 XinkPad280 xdg-desktop-por[2351]: GtkDialog mapped without a transient parent. This is discouraged.
2024-08-30T09:21:57.955717+02:00 XinkPad280 gala[1785]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
2024-08-30T09:21:57.956572+02:00 XinkPad280 gala[1785]: clutter_actor_set_child_below_sibling: assertion 'child->priv->parent == self' failed
2024-08-30T09:21:57.956833+02:00 XinkPad280 gala[1785]: clutter_actor_set_child_above_sibling: assertion 'child->priv->parent == self' failed
2024-08-30T09:21:57.957192+02:00 XinkPad280 gala[1785]: message repeated 2 times: [ clutter_actor_set_child_above_sibling: assertion 'child->priv->parent == self' failed]
2024-08-30T09:21:57.992184+02:00 XinkPad280 gala[1785]: WindowManager.vala:916: No transient found
2024-08-30T09:22:15.391481+02:00 XinkPad280 systemd[1572]: app-flatpak-hr.from.josipantolis.starfish-23144.scope: Consumed 11.872s CPU time.
Letterus commented 2 months ago

Two further freezes occured while doing some accidental stuff like scrolling through mails and opening the browser. In the syslog I only found entries about rtkit-daemon which I now disabled to see if it is causing the issues. If it is may that prove the theory that the freezing is related to synchronously handled DBus calls/events?

Letterus commented 2 months ago

Still observing freezes. Maybe they are shorter now and there no log messages at those times anymore…

Letterus commented 2 months ago

Coming back to @leolost2605's first proposal: Hardware/driver issue.

From time to time dmesg tells:

[ 1094.058291] workqueue: delayed_fput hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND
[ 1417.338303] workqueue: delayed_fput hogged CPU for >13333us 8 times, consider switching to WQ_UNBOUND

Symptoms look like the i915 driver hanging issue documented here: https://bbs.archlinux.org/viewtopic.php?id=246841&p=2

Edit: Currently I'm trying to add i915.enable_psr=0 to the kernel parameters, but I had to do it manually during boot time. Changing it in /etc/default/grub and executing update-grub2 and update-initramfs -u -k all had no effect. I don't know yet why.

Edit 2, note: Check the effect with cat /proc/cmdline and sudo cat /sys/module/i915/parameters/enable_psr.

Letterus commented 2 months ago

The latter is interesting: Neither land kernel options as boot parameters in grub nor does the updated kernel (-41) happen to be booted. It's still the old one (-40). What's going on there? /boot/grub/grub.cfg looks updated and correct though.

Letterus commented 2 months ago

Working with the machine having the i915.enable_psr=0 kernel boot parameter enabled for some time now - no freezes at all up to now. So it looks as this really is the driver issue mentioned above (that does not take place with every DE apparently).

I now need to figure out how to make the fix permanent as configuring grub does not work as pointed out above? Maybe that is a separate issue for another repo?

Letterus commented 2 months ago

Permanent fix works by creating the file /etc/modprobe.d/i915.conf

and entering:

options i915 enable_psr=0

Afterwards make sure to execute: sudo update-initramfs -u -k all

Then reboot.

Check the effect with sudo cat /sys/module/i915/parameters/enable_psr

Still don't know why grub parameters don't work and why it would load the older kernel.

Letterus commented 2 months ago

The resolution to graphic hangs is described above.

Boot issues are resolved by the last updates as documented in https://github.com/elementary/switchboard-plug-about/issues/335.

Closing this issue as resolved.