KhronosGroup / OpenGL-Registry

OpenGL, OpenGL ES, and OpenGL ES-SC API and Extension Registry
678 stars 274 forks source link

EGL_KHR_partial_update for FBOs #547

Open emersion opened 1 year ago

emersion commented 1 year ago

EGL_KHR_partial_update allows tilers to more efficiently re-use a surface buffer as a render target. It assumes EGLSurface is used, and doesn't support FBOs.

It would be nice to have an equivalent extension for FBOs.

stonesthrow commented 1 year ago

To be clear, you want to specify a rectangle(s) for renderbuffers or textures that are render targets, like a scissor or similar to QCOM's: https://registry.khronos.org/OpenGL/extensions/QCOM/QCOM_tiled_rendering.txt ? And a tiler would limit its fetch and write back to those rectangles? saving some bandwidth - being the benefit here?

emersion commented 1 year ago

Yeah, I think that's an accurate description, cc @fooishbar.

stonesthrow commented 1 year ago

@pdaniell-nv , Piers, lets put this on the GLEs/EGL agenda for discussion. thx

pdaniell-nv commented 1 year ago

We discussed this briefly and we're not sure there is a strong use case for FBOs. For EGL there can be more than one context accessing the surface, so the window-level extension makes sense. But for FBOs, which are within the scope of a GL context, it seems all the information needed is already available, and there are no external factors to keep track of.

emersion commented 1 year ago

I'm using FBOs in a Wayland compositor. The FBOs are bound to an EGLImageKHR which comes from a DMA-BUF, shared with Wayland client or the kernel.

stonesthrow commented 1 year ago

Not my expertise, but I would expect that the binning process would exclude any area that doesn't have a draw, so no tiles would be fetched or written for that area. So it should be automatic- no?

cubanismo commented 1 year ago

EGL_KHR_partial_update was for communicating a damage region, or scissor effectively, between an EGL surface and the thing where that surface was presented. E.g., between a wayland client and a wayland compositor, or a wayland compositor and a DRM-KMS display that supported self-refresh and/or partial shadow framebuffer updates of some sort. If it helped some tiler logic, I'm pretty sure that was by accident, and I'm curious how it helped. Regardless, that doesn't sound anything like a tiler spilling optimization thing described above, which as noted, should be local and shouldn't care whether an EGLImage wrapping a dma-buf or a simple texture is being used, and should be accomplishable with existing GL mechanisms.

Are you instead trying to avoid a tiler flush/writeback between the client rendering and the compositor texturing or something? That seems like it would be way more complicated than a partial update spec unless the hardware does something like that automatically internally at the HW or firmware level, in which case it seems like we'd need more info on exactly what the tiler needs to accomplish that beyond what it has now.

Now I'm just guessing at what's going on though. Can you be more specific about what it is that goes wrong when you switch from using an EGLSurface to an FBO in the wayland compositor code?

fooishbar commented 1 year ago

E.g., between a wayland client and a wayland compositor, or a wayland compositor and a DRM-KMS display that supported self-refresh and/or partial shadow framebuffer updates of some sort.

You're thinking of EGL_EXT_swap_buffers_with_damage.

If it helped some tiler logic, I'm pretty sure that was by accident, and I'm curious how it helped.

That was absolutely the intent behind and the cause for EGL_KHR_partial_update existing, yeah. It enables some optimisations to be done at the draw-call level - particularly avoiding unnecessary tile reload/flush which is always expensive. swap_buffers_with_damage goes all the way to the winsys, but partial_update pretty much stays local to the client.

stonesthrow commented 1 year ago

I was concerned from beginning that details for partial_update, swap_with_damage, buffer age, and window system particulars would confuse the specific use case here. Hence the need to clarify.

cubanismo commented 1 year ago

You're thinking of EGL_EXT_swap_buffers_with_damage.

You're right. Now I remember constantly confusing the two even when we were working on them as well. I'm still not clear what additional API is being requested though.

emersion commented 1 year ago

I'm still not clear what additional API is being requested though.

One can port code from EGLSurface to FBOs, however one looses support for EGL_KHR_partial_update (regressing perf on tilers) in the process. This is what happened in wlroots.

What I'm requesting is an equivalent of EGL_KHR_partial_update designed for FBOs instead of EGLSurface.

stonesthrow commented 1 year ago

Your render target has the contents of the previous draw, so you know it preserved, and you don't need to account for 'n' previous frames. But as I said, I think the binning pass of a tiler will limit tile fetch and write back to just the tiles that have a draw update. Do you have some indication that areas without update are being fetched? That's a hard one to detect I would think.

emersion commented 1 year ago

Your render target has the contents of the previous draw, so you know it preserved, and you don't need to account for 'n' previous frames.

No, my render target has the contents of the n-th previous draw. I have my own swapchain implementation with triple buffering.

I'm unsure how to check whether the tiler is doing some useless fetches or not.

stonesthrow commented 1 year ago

I see. You are cycling through n-textures like a swapchain. OK. Only if a profiler shows you the actual tiles that have a draw, would you be able to detect. I am not sure that is available. I would need to check with our guys about empty bin skipping and profiler information. Maybe other tile implementations could chime in, because I'm fairly sure about the binning process and that everyone does it.

emersion commented 1 year ago

Okay, that's helpful!

I'd definitely be happy to drop this issue if Mesa drivers already Do The Right Thing™. I'm not too knowledgeable about tilers, only happened to notice the discrepancy between EGLSurface and FBOs.

cubanismo commented 1 year ago

What I'm requesting is an equivalent of EGL_KHR_partial_update designed for FBOs instead of EGLSurface.

Right, but what do you propose as an API for that? I'm unclear what information conveyance is missing. As Jeff says, there are other APIs (I'm not clear why glScissor() is insufficient, but if it is, I would assume the QCOM ext is sufficient?) to convey the concept of a partial framebuffer update. The EGL partial update extension seems mostly useful to plumb some data in before OpenGL commands trigger a readback, but when you're by definition using only GL as a client API, it doesn't seem like there needs to be any winsys-level API involved, since it doesn't have any special surface-level knowledge to add.

I could still be wrong. I've always had trouble with the partial update concept, so I'm just trying to understand what is missing at the API level Vs. implementation level. Hopefully Jeff is right and all is already well. Are you seeing perf or power degradation after switching up the code in wlroots, or just assuming things are suboptimal?

emersion commented 1 year ago

Hm, my understanding was that indeed drivers need the information ahead-of-time, and that glScissors (or extensions to batch these) aren't enough. wlroots does things like this:

glBindFramebuffer(GL_FRAMEBUFFER, fbo);
glScissor(0, 0, 42, 42);
// draw…
glScissor(100, 100, 42, 42);
// draw again…

The second scissor here comes "too late" AFAIU.

The API could look like this:

void glSetFramebufferDamageRegion(GLenum target, EGLint *rects, EGLint n_rects);

called right after glBindFramebuffer for instance.

Are you seeing perf or power degradation after switching up the code in wlroots, or just assuming things are suboptimal?

It's tricky to tell, because wlroots has switched to FBOs a long time ago. I will try to compare wlroots and Weston, but not sure the results will be meaningful.

cubanismo commented 1 year ago

The second scissor here comes "too late" AFAIU.

OK, I can see how a multi-region command might allow larger batches, but I would have assumed with two scissors, you just get two batches of rendering corresponding to each scissored set of draws. I'm curious whether the tilers actually do more with multiple regions, or whether they just union them into a rect spanning both.

Regardless, it does seem worth validating that there is something to gain here before adding API. Either by inspection of driver code/GPU specs by the relevant IHVs, or via perf measurements.

stonesthrow commented 1 year ago

" I'm curious whether the tilers actually do more with multiple regions, or whether they just union them into a rect spanning both." When implementing partial_update in the past, I just unioned the areas.

stonesthrow commented 1 year ago

Imagination Tech and Qualcomm also think that bins that have no draw will not be fetched or drawn, but that can be invalidated with a clear. If RenderDoc or something can indicate the bins - unknown, its a very internal optimization feature.

zzag commented 1 year ago

Regardless, it does seem worth validating that there is something to gain here before adding API. Either by inspection of driver code/GPU specs by the relevant IHVs, or via perf measurements.

What can one use to benchmark this kind of thing? In kwin, I believe that we merged partial update support while taking "it improves performance" for granted.

janharaldfredriksen-arm commented 1 year ago

Imagination Tech and Qualcomm also think that bins that have no draw will not be fetched or drawn, but that can be invalidated with a clear.

Same for Arm (except for some older GPUs). We could still get some benefit from partial updates, but probably a micro-optimization at this point.

What can one use to benchmark this kind of thing?

Easiest if you could get vendor specific performance counters, I expect. In extreme cases, say updating a single pixel in a large resolution frame, you may see performance impact if an implementation fetched the full frame vs only the modified bin/area.

emersion commented 1 year ago

@jadahl has pointed me to a Mutter merge request from Erico Nunes:

I checked the impact of this with the lima driver using the hardware performance counters for number of memory reads and writes done by the GPU. There is a drop in those numbers, sometimes significant, but it seems that with mutter only a few frames benefit from this. It is not clear to me yet why other frames are not affected. If there are any hints, please let me know.