Use realtime GPU priority to avoid stalls during high GPU usage

psyke83 commented 3 years ago

Change derived from OBS project commit: https://github.com/obsproject/obs-studio/commit/ec769ef008b748f7dfba211daec9eb203ea4bea0 See related discussion: https://obsproject.com/forum/threads/obs-studio-24-0-3-gpu-priority-fix-testing.111669/

GNUDimarik commented 3 years ago

Did you test it with AMD drivers?

psyke83 commented 3 years ago

Yes. My test system is a Ryzen 2700 overclocked to 3.9Ghz all-core and a RX 570 4GB - with latest 20.12.1 drivers - on Windows 10 20H2. I use software H264 encoding with 8 threads running with real-time priority.

The slowdown can be reproduced by running a game at ultra settings, and comparing the in-game FPS to the Moonlight/encoder statistics. Without this change, if GPU load is near maximum, the encoder will throttle to 30fps and less even if the actual in-game FPS stays close to 60fps at all times. I've done extensive testing, and the problem is not caused by CPU starvation for encoding.

Example: In Resident Evil 3, if you set high settings and increase the render resolution to 200%, the title menu will run at ~58fps ingame, but encoder framerate will stay throttled to ~24fps. At more sensible settings, I will notice a sudden throttle of the encoder framerate from 60fps down to 30 during intensive scenes.

I realized this wasn't a software bottleneck because if you run a game at maximum settings windowed, it will throttle as expected, but if you alt-tab to another window, the encoder framerate will jump back up to ~60fps - so it pointed towards some kind of Windows scheduler or priority issue.

With this change applied, the encoder no longer drops frames, and stays within 1-2fps of the actual game FPS regardless of in-game quality settings. I prefer not to use the AMD HW encoder (as it has more input delay than CPU encoding), but it also has the same positive effect; unfortunately, the AMD HEVC encoding doesn't work with my card for some reason.

I've noticed the same phenomenon with Parsec (and reducing in-game quality settings is the only recourse to alleviate stutter).

GNUDimarik commented 3 years ago

Great! I implemented IDXGIOutput5 capture. Interesting ... I need to do a little work before making pull request. I think out changes should be fine. It should work faster I think. I've no deep understanding of windows interfaces at the moment and worked on this task. I found way for decreasing GPU usage by releasing surface and it worked fine with some little delays which is bad ... I'll test your changes with my commit today or tomorrow and merge pull request if everything is fine, Thanks a lot for contrinbuting!

GNUDimarik commented 3 years ago

Please rebase your changes to develop branch and change target branch to develop instead of master. Thanks!

psyke83 commented 3 years ago

I'm happy to rebase, but there's no develop branch that I can see?

P.S. I noticed your IDXGIOutput5 capture branch. I had already tested the new interface as a means of alleviating stutter, but it didn't help... but in my tests I hadn't made any changes to surface releasing. I did test your branch quickly, but the enforced 20ms sleep at the end of each capture seems to result in the encoder always throttling to 30fps for a 60fps stream (with or without my realtime patch).

GNUDimarik commented 3 years ago

20ms is too low. For allowing GPU to encode it needs to be 100 - 200ms. So it's not possible but that's "fen shuy" I think. At least AMF contibutor told us this:

"Thanks for the logs. They explain a lot. The common problem with CPU and with GPU encoder is that the application keeps acquired texture/surface from DD capture for a long time and do not release it. This blocks GPU other operations and DWM present and therefore the next capture. Please check AMF DVR sample and example of Desktop Duplication API capture usage: https://github.com/GPUOpen-LibrariesAndSDKs/AMF/blob/master/amf/public/src/components/DisplayCapture/DDAPISource.cpp

See the attached screenshots with explanations: CPU: "

So I decided to release the sturface. I think it needs to be removed delay and with your changes it should be fine. By comments in source code loki planned to do that.

psyke83 commented 3 years ago

I've seen the closed issue on that repo, and couldn't reproduce the same issue with my RX 570 card.

For example, if I run Moonlight and on the remote machine keep HWInfo64 open with the GPU section visible, then move the mouse, I'm not really seeing any lag or 100% GPU usage. The "GPU utilization" sensor stays mostly at 0%, but does jump up to 25% or so every once in a while, whereas the "D3D utilization" sensor registers ~2-5% usage while the mouse is moving over the Moonlight connection.

I do notice that the Windows moonlight-qt client gives me problems if I run it in borderless windowed mode; there will be extreme lag with the mouse when viewing the desktop, as the encoder tries to render at >100fps (even though the connected host monitor is only 60Hz). Aside from desktop lag, I notice that in games, at times the mouse seems to desync and fails to keep up with my actual movements, which almost feels like there's severe network congestion (but there really isn't any). If I make sure to run the client in the real fullscreen mode, that never happens. The fullscreen mode is bugged on my laptop, however, as it seems to bypass my system gamma settings (where the image is washed out). I was able to fix that by checking the "Disable fullscreen optimizations" executable flag.

Not sure if any of that helps you, or my problems are reproducible with a real NVIDIA card and the official GFE host, but that might be something you can check as well.

GNUDimarik commented 3 years ago

Btw. I still didn't share fix for moonlight for support anf hevc. Will do pull request. Amf puts idr info with 7 bytes offset. Fix is good tested. So i need just do pull request.

пн, 11 янв. 2021 г., 12:16 psyke83 notifications@github.com:

I've seen the closed issue on that repo, and couldn't reproduce the same issue with my RX 570 card.

For example, if I run Moonlight and on the remote machine keep HWInfo64 open with the GPU section visible, then move the mouse, I'm not really seeing any lag or 100% GPU usage. The "GPU utilization" sensor stays mostly at 0%, but does jump up to 25% or so every once in a while, whereas the "D3D utilization" sensor registers ~2-5% usage while the mouse is moving over the Moonlight connection.

I do notice that the Windows moonlight-qt client gives me problems if I run it in borderless windowed mode; there will be extreme lag with the mouse when viewing the desktop, as the encoder tries to render at >100fps (even though the connected host monitor is only 60Hz). Aside from desktop lag, I notice that in games, at times the mouse seems to desync and fails to keep up with my actual movements, which almost feels like there's severe network congestion (but there really isn't any). If I make sure to run the client in the real fullscreen mode, that never happens. The fullscreen mode is bugged on my laptop, however, as it seems to bypass my system gamma settings (where the image is washed out). I was able to fix that by unchecking the "Disable fullscreen optimizations" executable flag.

Not sure if any of that helps you, or my problems are reproducible with a real NVIDIA card and the official GFE host, but that might be something you can check as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LS3solutions/openstream-server/pull/4#issuecomment-757646898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPDTI645COQMUDKV2RQJVLSZKJR7ANCNFSM4VSORVVA .

GNUDimarik commented 3 years ago

@psyke83 please change your target to @m4rkoup fork branch develop. Seems we had a missunderstanding. I working on fix the issue when GPU is busy when we use hw encoder on AMD chipsets ... Just FYI: if I confused you above

psyke83 commented 3 years ago

Right... I was just making clear that the symptoms of the bug report you mentioned doesn't match the behaviour I see on my Polaris (RX 570) card - i.e., I don't see 100% GPU load or lag from mouse activity alone. With that in mind, the realtime scheduling patch is not aiming to fix the exact problem described in that report. Rather, the encoder lag I described only happens when the GPU genuinely is close to 100% usage from an intensive game that's running, etc. That bug report mentions that you're testing with an RX5700XT, so it might be harder to notice the issue that I described (and my patch is supposed to solve) when running games at 60fps on a card that's a fair bit more powerful than mine.

Anyway, I've opened a new PR (as I can't change target in this PR to an external repo) - see: https://github.com/m4rkoup/openstream-server/pull/2

Thanks.

LS3solutions / openstream-server

Use realtime GPU priority to avoid stalls during high GPU usage #4