NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 44 forks source link

Incomplete resizes with ≥1.1.10 #57

Open foucault opened 2 years ago

foucault commented 2 years ago

Greetings, I've recently updated to egl-wayland 1.1.10 and I noticed that several applications do not resize properly in a Wayland session. Specifically part of the window is disappearing while actively resizing and it is only fully shown when the resize operation is finished. The examples below are for gnome-weather (GTK 4) but that also happens with QtQuick/Qt5 based applications (wayland backend) and also other applications like alacritty. Please note that in the second example the window is not actually black, this is probably an artifact of the screencap. The contents are still visible although not rearranged until after resizing finished. Latest build from git also exhibits the issue and occasionally the application segfaults outright. Perhaps unsurprisingly the first commit that exhibits the issue is ddaa2720. The issue is not specific to mutter as it can be replicated under KWin+Wayland.

My current system configuration Distribution: Arch Linux NVIDIA Driver version: 515.48.07 Kernel version: 5.18.5 GPU: NVIDIA 3070

egl-wayland 1.1.9 - correct behaviour

egl-wayland 1.1.10

ionenwks commented 2 years ago

Hitting this as well while trying 515.57 + 1070 + gnome/mutter, and was rather bad on my end as it didn't really return to normal after finishing resize becoming unusable.

For now I'm using 1.1.10 with the aforementioned commit ddaa27209a8149c1960c70799c87151206401bbc reverted in Gentoo and everything seems(?) normal (not going back to 1.1.9 to keep wayland vulkan working with 515).

Arcitec commented 1 year ago

Is this fixed now? Or maybe I'm not using the EGL Wayland backend.

My system has egl-wayland-1.1.10-4.20220621git53b6a87.fc36 whose release notes specifically mention your bug report too:

https://bodhi.fedoraproject.org/updates/FEDORA-2022-e0d91b7c0b

I have no idea why they link to your bug report from that update's release notes.

I'm using Wayland with NVIDIA driver 510, and I have no issues while resizing windows.

It's possible that Fedora has removed the bad patch that you're talking about though... Or perhaps I am simply on the GBM backend instead. I dunno if egl-wayland handles GBM too or not.

foucault commented 1 year ago

I don't think so. Looks like they just reverted it on FC36 and FC37.

Arcitec commented 1 year ago

I don't think so. Looks like they just reverted it on FC36 and FC37.

Yep I was just about to post that! I found that file a moment ago. Fedora has reverted the deferred resizes patch.

ionenwks commented 1 year ago

For the record, in case wasn't clear from my previous comment, I did the same for Gentoo a week ago: https://github.com/gentoo/gentoo/blob/37e44543/gui-libs/egl-wayland/files/egl-wayland-1.1.10-revert-defer-resizes.patch

Imagine more distros may do the same and shouldn't use distro packages to test this.

erik-kz commented 1 year ago

This is why I was slightly hesitant about merging that commit, as mentioned here https://github.com/NVIDIA/egl-wayland/pull/53#issuecomment-1106617303

Technically I believe the new behavior is still in line with the spec, from https://www.khronos.org/registry/EGL/sdk/docs/man/html/eglSwapBuffers.xhtml

If the native window corresponding to surface has been resized prior to the swap, surface must be resized to match. surface will normally be resized by the EGL implementation at the time the native window is resized. If the implementation cannot do this transparently to the client, then eglSwapBuffers must detect the change and resize surface prior to copying its pixels to the native window. If surface shrinks as a result of resizing, some rendered pixels are lost. If surface grows, the newly allocated buffer contents are undefined. The resizing behavior described here only maintains consistency of EGL surfaces and native windows; clients are still responsible for detecting window size changes (using platform-specific means) and changing their viewport and scissor regions accordingly.

However, practically speaking, I agree this is a pretty bad regression. It's unfortunate, since the previous behavior was also incorrect, potentially causing hangs for multi-threaded applications. Like I said in the above comment, the proper fix will likely require a new EGL extension to let us resize an EGLStream, but until we're able to get that wired up on the driver side, the best course of action will probably be to revert this commit. I'll get that pushed and tag a new release.

@sulix - sadly, I imagine this will re-break SDL, so just want to make you aware.

Arcitec commented 1 year ago

@erik-kz Thanks for explaining that. Sounds like bad situations all around. Glad you're reverting it for now.

  1. Is the unreliable egl-wayland code (the multithreaded games crashing) used on DEs that use the new NVIDIA GBM Wayland code? Such as GNOME 42? Are DEs affected if they use GBM instead of EGLStreams?
  2. In what component would it be appropriate to fix this later? Re-merging the patch and fixing the DE compositor (such as making GNOME-Shell force repainting every frame)? Or as you say, an EGL extension and having the NVIDIA driver handle redrawing by itself?
erik-kz commented 1 year ago

This issue would be present with both GBM and EGLStream compositors, since both paths in egl-wayland use the same code for handling resizes.

On further reflection, I think this is a fundamental error in the logic of the patch. We're creating a new EGLSurface and then immediately calling eglSwapBuffers before anything had been rendered to that surface, so the first frame presented after a resize will have undefined contents. The section from the spec I included in my last comment would imply we should at least be preserving what was rendered to the old surface, but we're not even doing that.

I don't think it's reasonable to ask libraries to try and work around that, it's definitely our bug to fix.

I'm wondering if we might be able to modify wlEglResizeSurfaceIfRequired to copy the contents of the old surface over to the new one (using some intermediate buffer I guess). That might fix things, although I'm not totally sure. It would, at least, probably be easier to implement than the EGLStream resize extension mentioned above.

sulix commented 1 year ago

Yeah: the black-screen-when-resizing behaviour definitely is a result of this. For SDL, where most windows are not resizable, and the rest are resized rarely, it seemed an obvious tradeoff to avoid a deadlock. Particularly since SDL applications basically always render frames in a tight loop, so there's never a chance of getting "stuck" on a bad one after a resize.

Copying the old buffer to the new one seems the best overall approach to me.

The other workaround is to move the wlEglResizeSurfaceIfRequired call to after the SwapBuffers() (or, more precisely, after unlocking the surface->mutexLock. That makes my test app resize properly (albeit with a one-frame delay), but causes problems with, e.g., maximizing, at least on KWin. That being said, if I'm understanding the comments on [this recent issue with Mesa], this is actually the preferred behaviour, albeit with the caveat that things like glClear() should also reset the buffer size if there's a pending resize.

(One other experiment I did was to totally discard frames with the old size. This was actually pretty stable, but meant things basically would hang while resizing, with no visual feedback of the new size. The black screen seemed more usable to me.)

erik-kz commented 1 year ago

[this recent issue with Mesa]

Did you mean to include a link for this?

My only concern with the copy approach is that the application would still be rendering to the smaller surface before it calls SwapBuffers, so you might end up with a black strip on the edge of the window or something.

sulix commented 1 year ago

Whoops. The link was supposed to be: https://gitlab.freedesktop.org/mesa/mesa/-/issues/6547

As for copying, there's always going to be some problem if an application is rendering into something the wrong size. Assuming we wan't resize the backbuffer on the fly (and even then, if the frame was partially rendered, there'd be glitches on the edges), the choices are to clip the old buffer to the new one's size (potentially getting black strips on the edge), or to scale the image (which could result in some flickery blurriness, particularly around things like subpixel-sampled fonts).

Either of those seem "good-enough" to me for the case where rendering into the old-sized buffer has already begun. The real trick, I suspect, will be recognising that, e.g., glClear() should attach a new buffer with the correct size, as by definition the previous contents can't matter.

erik-kz commented 1 year ago

So just doing a copy from the old surface to the new one in wlEglResizeSurfaceIfRequired helps a little bit, but as expected it does often result in corruption around the edge of the window. It's particularly bad when maximizing / restoring.

We do have infrastructure in our core GL code to detect when a drawable has been resized and update our buffers as necessary, but the issue is that we don't really have any way to trigger this from egl-wayland. That's why I was thinking the "proper" solution to this would require some new EGL extension, basically to flag the EGLStream buffers as out-of-date.

However, absent that (since it would probably take a while), a reasonable compromise could be to have resize_callback check whether the surface is current to the calling thread. If it is, we can do the resize immediately like we were doing before. If not, only then will we defer the resize until the next eglSwapBuffers (and still copy from the old surface so we're not presenting a completely undefined frame).

sulix commented 1 year ago

So just doing a copy from the old surface to the new one in wlEglResizeSurfaceIfRequired helps a little bit, but as expected it does often result in corruption around the edge of the window. It's particularly bad when maximizing / restoring.

Could we clear the destination surface to black before doing the copy? It'd still look wrong, but probably less so than actual corruption.

We do have infrastructure in our core GL code to detect when a drawable has been resized and update our buffers as necessary, but the issue is that we don't really have any way to trigger this from egl-wayland. That's why I was thinking the "proper" solution to this would require some new EGL extension, basically to flag the EGLStream buffers as out-of-date.

Yeah, that's the tricky bit, I think: you'd need some way of either getting a callback when a buffer can be safely replaced (after glClear(), swapbuffers, etc.), or to move as much of the resize logic into the driver as possible, and have egl just request that the existing surface/buffer be resized, and the fact that this involves recreating things is totally hidden from view. Either way, as you say, it'd need a new extension.

However, absent that (since it would probably take a while), a reasonable compromise could be to have resize_callback check whether the surface is current to the calling thread. If it is, we can do the resize immediately like we were doing before. If not, only then will we defer the resize until the next eglSwapBuffers (and still copy from the old surface so we're not presenting a completely undefined frame).

I tried this, and it seems to work pretty well. It's definitely not a perfect solution, but it should fix this issue for everything except the multithreaded applications for which #53 was an issue in the first place, so it's a strict improvement on what we have: this regression is fixed in everything which worked before. I've sent in my implementation as PR #60. It'd still make sense to copy the contents of the old surface to the new one to work around this for, e.g., Impostor Factory (which allows arbitrary window resizing and resizes from a different thread), but it's definitely not nearly as urgent.