NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 44 forks source link

Newer versions cause GTK4 applications to stop updating their UI #54

Open myownfriend opened 2 years ago

myownfriend commented 2 years ago

I initially thought this was a Mutter issue until they suggested downgrading egl-wayland to 1.1.7 and that fixed it.

https://gitlab.gnome.org/GNOME/mutter/-/issues/2241#note_1458542

The issue is that Gnome Software and Celluloid, both GTK4 applications, will stop updating their UI. The applications will appear to freeze but it's just because the visuals stopped updating. All the buttons, scrolling, etc, all work fine behind that.

With Gnome Software it will happen when downloading updates. It will freeze but I can still click the list of updates and get a pop-ups for the changes to each application. Scrolling a down long list will show no visual representation of scrolling but click on items in the list will show you changes for applications higher up on the list. Closing the application by clicking the X in the top right corner also works.

Celluloid crashes just by jumping the play head around a file. It will happen after a few tries. After that, you can still jump around, play other files, etc. but there will be no visual change.

I'm on Fedora 36 and experienced the issue with egl-wayland-1.1.9-4 on my GTX 1070.

erik-kz commented 1 year ago

It looks like the code added by this commit https://mail.gnome.org/archives/commits-list/2016-May/msg04276.html doesn't interact well with the buffer release thread added in 1.1.8. Qt had a somewhat similar issue, which was eventually fixed by https://codereview.qt-project.org/c/qt/qtwayland/+/297975

The basic issue is that if multiple threads call wl_display_prepare_read, any calls to wl_display_read_events will block until all other threads have also called that function (or wl_display_cancel_read).

jadahl commented 1 year ago

There seems to be a missing wl_display_flush() before polling in wlEglHandleImageStreamEvents(). Does adding that help?

erik-kz commented 1 year ago

There seems to be a missing wl_display_flush() before polling in wlEglHandleImageStreamEvents()

Did you mean in buffer_release_thread? If so, yeah, I noticed that too, but unfortunately adding it doesn't seem to help.

What appears to be triggering the freeze is GTK closing a pop-up window (e.g. a tool tip). It happens pretty often, but not 100% of the time, so I'm thinking it's some kind of threading issue.

Here are the last few messages from WAYLAND_DEBUG before the freeze (afterwards, no further traffic happens over the Wayland socket). wl_surface@30 appears to be the main window, and wl_surface@49 is the pop-up.

I notice that the last frame callback for the main surface (51) is getting deleted, but never actually completes. That seems kinda suspicious. Also, that discarded event for the pop-up surface at the end is weird. Both of these can be seen every time the freeze occurs.

[1661928.327] -> wl_surface@30.frame(new id wl_callback@51) [1661928.520] wl_callback@44.done(32415328) [1661928.852] xdg_wm_base@26.ping(32415348) [1661928.877] -> xdg_wm_base@26.pong(32415348) [1661928.902] wl_pointer@10.motion(32415348, 84.94140625, 434.84765625) [1661928.952] wl_pointer@10.frame() [1661929.028] -> wl_surface@30.frame(new id wl_callback@44) [1661929.116] -> wl_surface@30.attach(wl_buffer@50, 0, 0) [1661929.152] -> wl_surface@30.damage(0, 0, 668, 470) [1661929.195] -> wl_surface@30.commit() [1661929.209] -> wl_display@1.sync(new id wl_callback@45) [1661929.532] wl_display@1.delete_id(45) [1661929.561] wl_buffer@47.release() [1661929.582] wl_callback@45.done(116) [1661934.854] wl_display@1.delete_id(51[1661934.866] wl_display@1.delete_id(44) ) [1661936.378] -> wl_buffer@53.destroy() [1661936.557] -> xdg_popup@46.destroy() [1661936.588] -> xdg_surface@42.destroy() [1661936.614] -> wl_surface@49.destroy() [1661955.943] discarded [unknown]@49.[event 1](0 fd, 12 byte) [1661955.968] wl_display@1.delete_id(53) [1661955.988] wl_display@1.delete_id(46) [1661955.992] wl_display@1.delete_id(42) [1661955.995] wl_display@1.delete_id(49)

erik-kz commented 1 year ago

After spending some more time trying to make sense of this, I have a slightly better idea what's going on, although still not entirely clear. GTK's frame scheduling / event handling logic is pretty complicated.

It seems that GTK is getting stuck in a state where events are paused, so it's not reading or dispatching any events from the Wayland socket (i.e. gdk_event_source_prepare / check are just returning immediately).

If I understand correctly, it ought to unpause itself on the next frame-clock cycle, which is triggered by a wl_surface::frame event. That event is getting delivered by the compositor, but it's just sitting in the wl_display's event queue, it never gets dispatched.

I'm not sure why this is triggered specifically by hiding a pop-up surface, or why it only happens with recent versions of egl-wayland. The main change that seems relevant is that, as of version 1.8, we have a separate thread that's continuously polling the Wayland socket for buffer release events, but this would have the side-effect of also reading in and queueing GTK's events. So maybe this is breaking some implicit assumption in how GTK's event loop works? I.E. it's now possible that there are events to dispatch even if they weren't read by GTK itself.

I've fiddled around to try and fix it, and the only thing that helped was calling _gdk_wayland_display_queue_events from gdk_event_source_check, even if events are paused, but I don't think that's correct as an actual solution.

erik-kz commented 1 year ago

To clarify, the mentioned hack was in GTK's code, not egl-wayland, and I have a feeling it would break other things. It was more of an experiment than a real fix.

However I do believe the real fix will ultimately need to be in GTK. We're not doing anything wrong in egl-wayland as far as I can tell. I suspect we're just violating some implicit assumption GTK is making related to reading events from the Wayland socket.

Do you have links to any of the other bugs you mentioned? Has anyone on the GTK side looked into this? As I said, I've tried to debug it to some extent, but admittedly don't fully understand all of the subtleties of GTK's event loop.

cubanismo commented 1 year ago

It looks like @chergert is closing all lockups related to egl-wayland based on perceived issue with its use of a damage thread, that aren't made clear anywhere I can see, and may or may not be related to this issue. This was also alluded to in #50, but no details were given there either. @chergert, could you elaborate on why you think the NVIDIA driver or egl-wayland are at fault here?

erik-kz commented 1 year ago

If I forcibly disable the damage thread I can still reproduce this, so I'm inclined to think it's not the problem here. GTK just ends up continually looping between gdk_event_source_prepare and gdk_event_source_check, with both of those returning false because events are paused for the display.

Also, if it were an issue with the damage thread that wouldn't explain why it only started happening with version 1.1.8 since we used a damage thread well before that.

wroyca commented 1 year ago

@erik-kz @cubanismo https://github.com/chergert is no longer active on GitHub, so it will not be possible to join him through tagging here. I took the initiative to inform him and here is what he said about it on https://gitlab.gnome.org/GNOME/gnome-builder/-/issues/1718:

That said, I believe NVIDIA did some additional troubleshooting and found that it is related to buffer size changes and how they had deferred work to the next damage/swap-buffer. Given that it's EGL streams, and the clients may not send more work until notified of the size change, there is nothing to pump their EGL wrapper to force another swap-buffer.

Some stuff was reverted, I don't know the state beyond that.

To contact him further, the best would probably be to join him on IRC / Matrix :+1:

erik-kz commented 1 year ago

Discussed this on Matrix, and Sebastian Keller informed me of a very similar-looking issue reported on GTK's GitLab page https://gitlab.gnome.org/GNOME/gtk/-/issues/4941

This closed MR https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/4856 has a good explanation of the root-cause, although in the end they went with https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/4858 as a fix.

Indeed, I am no longer able to reproduce the hang using a build of GTK from the main branch. Therefore, I believe this issue is the same as the one above. I suspect our egl-wayland library was never directly involved at all, the addition of the buffer release thread may have simply altered the timing of some events which made the problem reproduce more readily.

erik-kz commented 1 year ago

I should also note that the GTK fix has not yet landed in a release, so hopefully once it does someone else can confirm that it does definitely resolve this issue.

wroyca commented 1 year ago

Discussed this on Matrix, and Sebastian Keller informed me of a very similar-looking issue reported on GTK's GitLab page https://gitlab.gnome.org/GNOME/gtk/-/issues/4941

This closed MR https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/4856 has a good explanation of the root-cause, although in the end they went with https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/4858 as a fix.

Indeed, I am no longer able to reproduce the hang using a build of GTK from the main branch. Therefore, I believe this issue is the same as the one above. I suspect our egl-wayland library was never directly involved at all, the addition of the buffer release thread may have simply altered the timing of some events which made the problem reproduce more readily.

@erik-kz The problem is unfortunately still present and recurring for me using a GTK build from the main branch. Can anyone else confirm?

sbstnk commented 1 year ago

I think the bug chased from https://github.com/NVIDIA/egl-wayland/issues/54#issuecomment-1183798933 onwards was indeed the gtk bug that has been fixed, because all the logs and observations match exactly what would be expected from that bug, but I don't think that was the originally reported issue here:

erik-kz commented 1 year ago

@wroyca do you have a reliable way to reproduce the issue? I've tried moving the play head around in Celluloid while watching a video but it did not freeze (with a recent build of GTK).

wroyca commented 1 year ago

@wroyca do you have a reliable way to reproduce the issue? I've tried moving the play head around in Celluloid while watching a video but it did not freeze (with a recent build of GTK).

Gnome builder (local build, not flatpak) autocompletion popover will trigger the hang

https://gitlab.gnome.org/GNOME/gnome-builder/-/issues/1718

sbstnk commented 1 year ago

The builder issue might also be a different issue than what was originally reported here. Does it maybe trigger with egl-wayland 1.1.7 as well?

wroyca commented 1 year ago

The builder issue might also be a different issue than what was originally reported here. Does it maybe trigger with egl-wayland 1.1.7 as well?

I also thought of this and tested 1.1.7 beforehand, unfortunately it does not trigger with 1.1.7

erik-kz commented 1 year ago

Correct me if I'm wrong, but doesn't gnome-builder still use gtk3? Just curious because the original issue seemed to imply it was specific to gtk4.

chergert commented 1 year ago

Correct me if I'm wrong, but doesn't gnome-builder still use gtk3? Just curious because the original issue seemed to imply it was specific to gtk4.

Stable does, until tomorrow when GNOME 43 is released and Nightly will become Stable as 43.0.

https://nightly.gnome.org/repo/appstream/org.gnome.Builder.Devel.flatpakref will get you nightly via Flatpak.

wroyca commented 1 year ago

So I have cleaned up my Fedora 37 installation to Fedora 37 silverblue and I no longer have this problem. Toolbox containers should have GPU support by default, meaning there should be no need to manually specify GPU compatible containers. At first glance, there isn't much difference between my two installations, they have been relatively updated with minimal outside influence, so I can't say why exactly it works with the latter but not with the former. AFAIK both installations use relatively the same libraries, so there must be a small needle in a haystack that was just enough to break egl-wayland before my switch to silverblue.

wroyca commented 1 year ago

@erik-kz I tried again with the default installation of Fedora 37 and, sure enough, it works fine. Something probably slipped in accidentally, even if I had compiled both myself. Regardless, this issue can be closed now

EDIT: Nevermind. I can reproduce it on Fedora 37. Alas.