Improve GPU readback performance for OSR

magreenblatt commented 8 years ago

Original report by Tammo Hinrichs (Bitbucket: kebbyfr).

Hi,

So. Context first. I'm in the process of migrating a pretty big realtime 3D suite from Awesomium to CEF for our in-engine and general web view needs, and so far it's looking pretty good. The only real problem is the positively abysmal frame rate for offscreen renders - a topic that was mentioned here time and time again but never solved satisfyingly.

So I did some research, written in story form because it's fun and to clarify how I got here.

Case in point: This nice thing here. Renders at a perfectly steady 60fps in Chrome and other browsers, and CPU/GPU are seriously bored while showing it (i7 5820, AMD FirePro W9100, Win10 64bit). Yet, when using CEF and offscreen rendering, I only get very stuttery and unstable 20 to 30fps out of it.

To eliminate all other error sources I've written a minimal test case that just opens a windowless browser with the page above and in the OnPaint callback dumps the dirty rectangle sizes and elapsed time between calls - see attachment. And lo and behold - after some random numbers while the page is loading it "stabilizes" at 30 to 70ms between calls (dirty rect omitted for brevity, but it's 522x446)

#!JS
239, 3917,  17, 275,  35, 121,  79,  17,  94,  55,  22, 194,  41,  65,  36, 257, 117,  19, 206, 233,  34, 159,  30, 182, 354,  33,  44,  54,  63,  45,  58,  55,  62,  49,  54,  46,  51,  49,  49,  77,  40,  51,  64,  69,  50,  55,  86,  71,  37,  54,  32,  48,  50,  50,  55,  45,  50,  50,  66,  64,  56,  47,

So yep, that's too slow and unstable as hell. But here's the kicker (that the test case doesn't show but if you look at the render output you can see it): The page has an FPS meter on it, and it shows a perfect 60, or in fact, whatever you set in browser_settings.windowless_frame_rate.

And if I squint hard enough, the millisecond deltas in that dump kind of cluster around multiples of 17ms. So could it be that actually, Chromium itself renders at a nice 60fps but CEF randomly only gives me every second or third or fourth frame?

So, into the source (I'm using the Win64 Spotify build of CEF 3.2840.1515.g1b7ab74 plus PDBs that allow me to debug), set a few random breakpoints and try to make sense of all of it. And after some hours of digging I found this in CefCopyFrameGenerator::GenerateCopyFrame():

#!c++
    // Don't attempt to generate a frame while one is currently in-progress.
    if (frame_in_progress_)
      return;
    frame_in_progress_ = true;

Ehrm. One breakpoint and 20 seconds later I had the proof: More than two thirds of frames that come out of OnSwapCompositorFrame get thrown away because another frame is "in progress".

Now to be very clear and to avoid the usual "but GPU readback is slow" replies - I've been writing 3D engines on PCs and consoles for 15 years now, and trust me, no sensible amount of GPU readback or IPC or whatever can possibly exceed the 16 milliseconds one has got per frame - especially not for a measly 522x446 rectangle.

So what could be the holdup? What in the world could be the reason that a frame takes more than 16ms to arrive back at the CPU? (the frame_inprogress logic itself seems to do fine, otherwise it would just stop rendering at one point)

And then it hit me: Latency.

Now what GPU drivers do is, they try to keep the GPU busy. The easiest way to do this is to just queue up a ton of commands before the GPU even starts rendering, and the result of this is that the GPU is easily one, two, three frames behind the CPU, and the image arrives on the screen somewhat later. Fine for noninteractive stuff and not too action oriented games, not fine for stuff that needs short reaction times, but a fact of life for realtime 3D devs (and there are ways around it, more on that below).

So. If I may make an educated guess what happens: OnSwapCompositorFrame gets called when the compositor has finished its work on the CPU. At this point all commands are in the buffer but the frame isn't actually fully rendered yet - the GPU is still doing its thing. Now InternalGenerateCopyFrame() calls cc::CopyOutputRequest::CreateRequest() which adds a readback command to the command buffer and registers a callback, and everybody goes on with their lives.

Sixteen milliseconds later. The GPU is still not finished with compositing that frame because there really was so much other stuff to do and the driver was really chill about it anyway, but: In the CPU the compositor just finished queuing up the commands for the next frame already and calls... OnSwapCompositorFrame. Which calls GenerateCopyFrame(). Which passes by the code snippet above and is like "wait, there's still a frame in flight, let's exit". And BOOM, CEF just threw away a perfectly good frame of animation.

Some time later: that first frame finally arrives at the CPU and gets handled by OnPaint(), and the whole thing starts from the beginning. The end result is Chromium rendering at full frame rate but only a stuttery version of that arriving at the client. Exactly as observed. lights pipe

Now, luckily there's a few ways to address that issue. I'll just outline them here because you're probably way faster at fixing than I would be.

The easy way (actually not a bad way): At the end of InternalGenerateCopyFrame() add a GPU flush. No idea how that looks in the Chromium gl or gpu subsystems but there should be a means to force the GL driver to flush all pending commands to the GPU and make it render them right now. Of course this makes the GPU stall now and then and degrades overall perf by a few percent but it fixes the latency for OSR applications. And one could argue that responsiveness is way more important than a few percent of rendering perf in a web browser. :)
The hard way: Embrace the fact that there can be more than one frame that has a readback pending. This probably means wiring the dirty rectangles list through the whole callback chain and replaceing the frame_inprogress stuff with a queue as to at least prevent several OnPaint() callbacks running at once, and possibly it means double or triple buffering the CPU side image and merging dirty rectangles of several frames, but this would be the most elegant solution - with the drawback that it doesn't actually fix the latency for the user so please add the GPU flush anyway as a setting.

If I'm right this should fix most of CEF's performance problems with OSR. So, to quote my favourite AI: Thank you for helping us help you help us all :)

Bonus question time! Half serious because it'd mean a lot of work for everyone but would be awesome: Why transfer the image to the CPU at all if the next thing I do is reupload it to the GPU anyway? How hard would it be to add let's say an API where you either get a shared surface handle as a callback or specify your own shared surface for CEF to render into? Restricting pixel formats etc. or forcing clients to double buffer to avoid stalls would be fine. This would be a pro level "know exactly what you're doing" API. (Shouldn't be a problem under Windows with Chromium using ANGLE and thus Direct3D under the hood anyway, no idea about Linux or Mac tho)

Also, currently CEF clocks offscreen rendering with its CefBeginFrameTimer class - any plans on exposing that functionality to the user? I'd really like to let Chromium render in lockstep with our app to get guaranteed silky smooth 60fps (even with a frame of delay or two). :)

magreenblatt commented 8 years ago

The advice currently is to disable GPU and use software rendering for best frame rate / performance tradeoff.

Are you aware of issue #1006? That is likely the best general solution to GPU performance issues.

I'll leave this issue open since what you're proposing is mostly performance enhancements to the existing GPU readback implementation. PRs to implement that as a short-term fix (with issue #1006 being the preferred long-term solution) would be welcome.

magreenblatt commented 7 years ago

Original comment by Tammo Hinrichs (Bitbucket: kebbyfr).

There's one point where I disagree: These are no "performance enhancements", it's not that some inner loop takes a millisecond too much - what we're talking about here is an actual, honest bug. As soon as Chromium updates the screen fast enough to make the GL driver queue more than one frame of commands, the abovementioned part of the code stumbles over its own feet. Flushing the GPU after issuing the readback command sounds like a drastic solution but it should fix things and it only applies to the GPU accelerated OSR path anyway so chances of breaking something that worked before are pretty slim.

But ok, if you say PRs are welcome then I'll give it a shot. Wasn't aware of #1006, and it pretty much sounds like what we'd want in the long term, too. Plus what I wrote about the frame clock above.

Btw, any workaround like disabling GPU accel won't work for us because that'd mean goodbye to, among others, WebGL which our customers expressly requested :(

magreenblatt commented 7 years ago

Original comment by Tammo Hinrichs (Bitbucket: kebbyfr).

Hey,

So I'm actually making progress of sorts, but of course the rabbit hole is deeper than anticipated.

One question about the readback in CefCopyFrameGenerator: Is there any specific reason you send a texture readback request and then have a callback chain read that texture back instead of requesting a bitmap directly? That extra step doesn't seem to make any sense and actually makes things worse (latency is applied twice, GPU mem usage, etc), and just using CopyOutputRequest::CreateBitmapRequest() instead (rendering all the texture related code unused) helps a lot with the frame rate. But of course this has a "too good to be true" vibe to it, so - anything I missed?

magreenblatt commented 7 years ago

Original comment by Tammo Hinrichs (Bitbucket: kebbyfr).

So I have a possible PR ready at https://bitbucket.org/kebbyfr/cef - only the "Create pull request" button is missing from the left-hand menu. How can I proceed?

magreenblatt commented 7 years ago

It's hard to say if CreateBitmapRequest is a viable alternative without testing it. From what I can tell it only impacts the GLRenderer::GetFramebufferPixelsAsync method (request->force_bitmap_result() will return true), which seems OK.

Did you fork the CEF repo using Bitbucket's website? If so, you should see a "Create pull request" option when you hover over the "..." button at the top left corner of the Bitbucket web interface.

magreenblatt commented 7 years ago

Original comment by Tammo Hinrichs (Bitbucket: kebbyfr).

... aaaaand sorry for being stupid. That button is in my repo, not yours. Too early in the morning I'd say :)

magreenblatt commented 7 years ago

PR link: https://bitbucket.org/chromiumembedded/cef/pull-requests/96/osr-fix-gpu-cpu-readback-performance/diff

magreenblatt commented 7 years ago

Original comment by Adrian Lis (Bitbucket: Adrian L).

I am wondering, are there any plans to merge this PR as the fix for the GPU accelerated rendering? I am aware that there are plans to fix it in a different manner in the future, but it seems solid for now? I am interested in this change since I had to give up the GPU accelerated composition due to subpar performance when using OSR.

magreenblatt commented 7 years ago

Original comment by vivien anglesio (Bitbucket: vanglesio).

Hi !

Any news about merging this ??

magreenblatt commented 7 years ago

Original comment by Ben Hamrick (Bitbucket: Ben Hamrick).

This is very important to me as we are trying to run open-gl applications in cef. Is this going to be merged anytime soon?

magreenblatt commented 6 years ago

Done in master revision 4c795f5 (bb) and 3239 branch revision 8f2fa99 (bb).

magreenblatt commented 6 years ago

Original comment by vivien anglesio (Bitbucket: vanglesio).

Youpiiiii !!! Thanks 👏👏👏🎉🎉🎉

magreenblatt commented 6 years ago

Original comment by Marcin (Bitbucket: Marcin, GitHub: Marcin).

I downloaded the latest build CEF 3.3239.1723.g071d1c1 / Chromium 63.0.3239.132.

CefClient runs on the provided example https://www.shadertoy.com/view/Msf3R8 with 60fps without off screen enabled, however still only 30 fps when offscreen is enabled , so I am puzzeled, has this change been applied in the latest build or it has not improved as expected the rendering.

I used the following switches for offscreen: --off-screen-rendering-enabled --enable-gpu --url=https://www.shadertoy.com/view/Msf3R8 and the following for normal window mode: --enable-gpu --url=https://www.shadertoy.com/view/Msf3R8

I wonder whether this change has fixed anything Do you think 30:60 fps difference is simply due to lack of shared surface - issue 1006 #1006

Marcin

magreenblatt commented 6 years ago

Original comment by vivien anglesio (Bitbucket: vanglesio).

Yes, same thing for me. No improvement on OSR with the last build of CEF

magreenblatt commented 6 years ago

Original comment by Tammo Hinrichs (Bitbucket: kebbyfr).

As the one who opened this issue and who submitted the PR: Yes, it's far from ideal and you'll need a pretty beefy machine to reach the 60fps. There's still just too many DMAs, allocs, memcpys and handoffs between threads and processes involved to get really good performance out of it (plus whatever your app needs to do to get the result back onto the screen). Nevertheless, on fast computers it did make a difference here; I must add tho that we spent a lot of time optimizing the pipeline for resource creation and update in our product. No idea how good cefclient is in this regard.

Also, the underlying Chromium still renders at its own internal frame rate which most probably doesn't match that of your renderer. That's another source of stutters and possibly reduced frame rate.

The only real solution would indeed be a new API based on shared surfaces. Best case would IMO be something like:

the client calls into CEF with a surface to render into and, very importantly, a timestamp for the frame
these calls get queued so the client can send one call per frame even when the call hasn't been handled yet
Chromium renders the frames from the queue, using the timestamps as time reference (if they make sense)
whenever a frame has finished rendering and the content was blitted into the destination surface, a callback into the client is called

What's of utmost importance there is that this should be done with the minimum amount of IPC possible. No Client->Render->GPU->Render-->Client round trips please.

This puts some burden on the client as it needs to handle surfaces with multi buffering (as each surface will be "in flight" for several frames) but that's an acceptable thing to ask for I think, and that solution would keep latency to a minimum.

Now the big problem: This is a pile of metric tons of work. Even getting shared surfaces to work reliably and natively on all three platforms is something that will inevitably show up in your hair color (will CEF only support GL? What about Vulkan? And can we just give it a DXGI surface handle if we're running on Windows with ANGLE?), and don't get me started about practically rewriting Chromium's V-Sync handling to make the timing lock work. Also most of the code for this would need to reside in the Render and GPU processes, so that's a lot of modifications to Chromium itself, yay.

So basically that's why I only tackled that low hanging fruit, and I'm sorry if it still isn't as good as you would wish. Anyone up for the task who hasn't ten other things to do? :)

magreenblatt commented 6 years ago

Original comment by vivien anglesio (Bitbucket: vanglesio).

@kebbyfr, Yes you're right and thanks for your work. I would love to bring my help, but I don't have the skills needed for that... maybe you could check issue #1006, it seems there is some work in progress on rendering with shared surfaces. Maybe you could help ? Thanks again.

magreenblatt commented 6 years ago

Original comment by Marcin (Bitbucket: Marcin, GitHub: Marcin).

Thanks Tammo for the explanation and the work you did. I probably put too much expectation in that patch and was afraid that it was wrongly applied that it has little effect on cefsimple.

magreenblatt commented 8 years ago

Original changes by Tammo Hinrichs (Bitbucket: kebbyfr).

edited description

magreenblatt commented 8 years ago

changed title from "Offscreen rendering performance, for real now" to "Improve GPU readback performance for OSR"

magreenblatt commented 7 years ago

set component to "Framework-HasFix"
changed kind from "bug" to "enhancement"

magreenblatt commented 6 years ago

changed state from "new" to "resolved"

chromiumembedded / cef

Improve GPU readback performance for OSR #2046