NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 43 forks source link

Application crashes when using explicit sync #110

Open Molytho opened 1 month ago

Molytho commented 1 month ago

Reference: https://github.com/NVIDIA/egl-wayland/pull/104#issuecomment-2123897621

While I don't use KDE I have similar issues with sway although the only application I encountered such crashes where firefox and thunderbird.

The issue here is a race between sending requests to the compositor:

[3880880.548] -> wl_compositor@36.create_region(new id wl_region@83) [3880880.573] -> wl_region@83.add(0, 0, 1916, 1078) [3880880.584] -> wl_surface@61.set_opaque_region(wl_region@83) [3880880.591] -> wl_region@83.destroy() [3880880.603] -> wl_surface@61.attach(wl_buffer@71, 0, 0) [3880880.631] -> wl_surface@61.commit() [3880880.687] -> wp_linux_drm_syncobj_surface_v1@74.set_acquire_point(wp_linux_drm_syncobj_timeline_v1@53, 0, 8) [3880880.698] -> wp_linux_drm_syncobj_surface_v1@74.set_release_point(wp_linux_drm_syncobj_timeline_v1@82, 0, 2) [3880880.704] -> wl_surface@61.damage(0, 0, 1916, 1078) [3880880.712] -> wl_surface@61.commit() [3880880.732] -> wl_display@1.sync(new id wl_callback@79)

full log

This is a portion of the WAYLAND_DEBUG log when such a crash occurs (from thunderbird). What I imagine happens here is that one thread calls set_opaque_region on the surface while another attaches a new buffer. The commit after the set_opaque_region call happens to be in between attaching the buffer and setting the corresponding syncobj points. This is per definition a protocol violation of wp_linux_drm_syncobj_surface_v1.

So the real issue is that there is no way to send multiple request (like attach, set_acquire_point, set_release_point) atomically to the compositor.

MaLoLHD commented 1 month ago

I have also seen this issue on GNOME. I've also used WAYLAND_DEBUG to get the log for this, here are the last few lines:

[3561462.592] -> wl_compositor@32.create_region(new id wl_region@77) [3561462.607] -> wl_region@77.add(0, 0, 1920, 1048) [3561462.613] -> wl_surface@63.set_opaque_region(wl_region@77) [3561462.617] -> wl_region@77.destroy() [3561462.625] -> wl_surface@63.attach(wl_buffer@74, 0, 0) [3561462.638] -> wl_surface@63.commit() [3561462.646] -> wp_linux_drm_syncobj_surface_v1@68.set_acquire_point(wp_linux_drm_syncobj_timeline_v1@55, 0, 13) [3561462.652] -> wp_linux_drm_syncobj_surface_v1@68.set_release_point(wp_linux_drm_syncobj_timeline_v1@72, 0, 5) [3561462.658] -> wl_surface@63.damage(0, 0, 1920, 1048) [3561462.663] -> wl_surface@63.commit() [3561462.668] -> wl_display@1.sync(new id wl_callback@49) [3561462.792] wl_display@1.delete_id(77) [3561462.801] wl_display@1.error(wp_linux_drm_syncobj_surface_v1@68, 4, "No Acquire point provided") Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1@68: error 4: No Acquire point provided (t=1.03728) [GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1@68: error 4: No Acquire point provided

Full log

System: GNOME 46.1 | Mutter 46.1 Fedora 40 | Kernel 6.8.10-300.fc40.x86_64 NVIDIA Driver 555.42.02 installed from NVIDIA's website (not from RPMFusion's packages) NVIDIA Geforce GTX 1060 6GB

Note that I had to add this line to /etc/modprobe.d/nvidia.conf to get the Wayland session to show up in GDM. Regardless, I had to do this for other driver versions as well, and I don't think it has influenced the bug:

options nvidia "NVreg_PreserveVideoMemoryAllocations=1"

Arcitec commented 1 month ago

What I imagine happens here is that one thread calls set_opaque_region on the surface while another attaches a new buffer. The commit after the set_opaque_region call happens to be in between attaching the buffer and setting the corresponding syncobj points. This is per definition a protocol violation of wp_linux_drm_syncobj_surface_v1.

If I understand the log correctly, does it mean that the applications such as Firefox are violating the Wayland explicit sync protocol by committing contents to an explicit sync surface before it's been allocated?

And if that's the case, this is something that needs fixes elsewhere (in GUI toolkits?), not in the NVIDIA driver.

Molytho commented 1 month ago

No, Firefox doesn't knowingly violate the explicit sync protocol. There are likely two threads doing wayland compositor calls in parallel (which is perfectly fine. The functions are thread safe) leading to an invalid sequence of individual calls. It's also not a bug in NVIDIA's code. I sadly don't thinks that it is easily fixable. It likely needs some changes in the wayland-client library and should probably be discussed in wayland's bug tracker.

TsunamiMommy commented 1 month ago

I specifically remember mentions of this exact issue being discussed on the MR for explicit sync. I think the conclusion of that discussion was that the firefox behavior would be marked as a protocol violation. So it'd be up to Mozilla and Thunderbird to fix it. The protocol is working as designed.

MaLoLHD commented 1 month ago

I have found that this issue also happens on KDE Plasma 6.0.5 with GTK4/libadwaita applications when a hamburger menu tries to close. This does not happen on GNOME. I have tested it with Curtail, Paper Clip, Foliate and the libadwaita demo.

Log from the adwaita demo:

[3772322.309] wl_pointer@27.button(2361, 1261758, 272, 1) [3772322.327] wl_pointer@27.frame() [3772322.392] -> wl_pointer@27.set_cursor(2356, wl_surface@31, 4, 4) [3772322.408] -> wl_surface@31.attach(wl_buffer@55, 0, 0) [3772322.422] -> wl_surface@31.set_buffer_scale(1) [3772322.434] -> wl_surface@31.damage(0, 0, 32, 32) [3772322.449] -> wl_surface@31.commit() [3772322.482] -> wl_pointer@27.set_cursor(2356, wl_surface@31, 4, 4) [3772322.503] -> wl_surface@31.attach(wl_buffer@55, 0, 0) [3772322.516] -> wl_surface@31.set_buffer_scale(1) [3772322.525] -> wl_surface@31.damage(0, 0, 32, 32) [3772322.538] -> wl_surface@31.commit() [3772322.559] -> xdg_popup@67.destroy() [3772322.583] -> xdg_surface@66.destroy() [3772322.595] -> wl_surface@62.attach(nil, 0, 0) [3772322.609] -> wl_surface@62.commit() [3772323.809] -> wl_pointer@27.set_cursor(2356, wl_surface@31, 4, 4) [3772323.836] -> wl_surface@31.attach(wl_buffer@55, 0, 0) [3772323.845] -> wl_surface@31.set_buffer_scale(1) [3772323.853] -> wl_surface@31.damage(0, 0, 32, 32) [3772323.874] -> wl_surface@31.commit() [3772328.831] -> wl_surface@36.frame(new id wl_callback@79) [3772328.851] -> wp_presentation@21.feedback(wl_surface@36, new id wp_presentation_feedback@77) [3772328.854] -> wl_surface@36.offset(0, 0) [3772329.031] -> wl_surface@36.attach(wl_buffer@63, 0, 0) [3772329.043] -> wp_linux_drm_syncobj_surface_v1@50.set_acquire_point(wp_linux_drm_syncobj_timeline_v1@51, 0, 24) [3772329.046] -> wp_linux_drm_syncobj_surface_v1@50.set_release_point(wp_linux_drm_syncobj_timeline_v1@56, 0, 8) [3772329.063] -> wl_surface@36.damage(0, 0, 922, 698) [3772329.066] -> wl_surface@36.commit() [3772329.068] -> wl_display@1.sync(new id wl_callback@72) [3772329.177] wl_display@1.delete_id(67) [3772329.183] wl_display@1.delete_id(66) [3772329.186] wl_display@1.error(wp_linux_drm_syncobj_surface_v1@68, 4, "explicit sync is used, but no acquire point is set") Gdk-Message: 08:54:55.679: Error flushing display: Protocol error

Molytho commented 1 month ago

I specifically remember mentions of this exact issue being discussed on the MR for explicit sync. I think the conclusion of that discussion was that the firefox behavior would be marked as a protocol violation. So it'd be up to Mozilla and Thunderbird to fix it. The protocol is working as designed.

Thanks for the note. Found it: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_2243522

@MaLoLHD This portion of the log is not very useful. The object (wp_linux_drm_syncobj_surface_v1@68) that violates the protocol is never referenced and we don't know to which surface it is attached. Could you send the full log?

MaLoLHD commented 1 month ago

The object (wp_linux_drm_syncobj_surface_v1@68) that violates the protocol is never referenced and we don't know to which surface it is attached.

Here's the full log

Molytho commented 1 month ago

The object is attached to wl_surface@62 which got a null buffer attached. This is not a protocol violation so it's a bug in kde's implementation.

Arcitec commented 1 month ago

Thanks for the note. Found it: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_2243522

That was a really important find. Three people from NVIDIA, the XWayland maintainer, and one of the KDE Explicit Sync developers, and others, are all talking about it there. I've read through every reply and can summarize it as follows:

  1. Wayland is thread-safe, but that brings a risk of doing stupid, protocol-breaking things if you don't synchronize your calls into the proper order manually.
  2. Firefox has two simultaneous rendering threads and is doing stupid, protocol-breaking things and is violating the Wayland protocol. They aren't waiting for the 1st thread's surface allocation before they start writing to it in the 2nd thread.
  3. It's Firefox's bug, not Wayland or NVIDIA or GNOME or KDE etc.
  4. There is no interest in modifying Wayland to allow or safeguard against such protocol violations.

I bet there's a Mozilla bug tracker thread about it somewhere too.

MaLoLHD commented 1 month ago

I bet there's a Mozilla bug tracker thread about it somewhere too.

It seems that it is being discussed here and here on Mozilla's bug tracker.

Zamundaaa commented 1 week ago

No, Firefox doesn't knowingly violate the explicit sync protocol. There are likely two threads doing wayland compositor calls in parallel (which is perfectly fine. The functions are thread safe) leading to an invalid sequence of individual calls. It's also not a bug in NVIDIA's code.

Wayland as the messaging protocol is thread safe, but access to the wl_surface from different threads is not. What Firefox is doing has always been broken, and always had the potential to cause crashes and bugs. With explicit sync it just gets way more chances for that to actually cause visible problems.