GPUOpen-LibrariesAndSDKs / AMF

The Advanced Media Framework (AMF) SDK provides developers with optimal access to AMD devices for multimedia processing
Other
598 stars 149 forks source link

[Bug]: Performance issue with 7900 XTX and FFmpeg based streaming server (Sunshine) #384

Closed HakanFly closed 9 months ago

HakanFly commented 1 year ago

Hello,

Describe the bug I don't know if my issue is related to this know issue on driver release notes : Video stuttering or performance drop may be observed during gameplay plus video playback with some extended display configurations on Radeon™ RX 7000 series GPUs.

I had a massive performance drop in the game once I start a stream session. First, try all possible configurations. I tried to document myself and understand what's going on.

I think it's related to FFmpeg with d3d11va display capture. But it is impossible to use another method without changing the software. In any case it is out of my skills.

I had no problem with my (not very) old 6950 XT. AMD Link (built in display capture ?) or Steam Remote Play (VAAPI NV12 + AMF H264) have no problem.

I was a long time Nvidia customer and Gamestream user. And it was that great open source Moonlight / Sunshine combo that made me switch to AMD painlessly. I think it's such a shame this regression with the latest GPU architecture. But I try not to lose hope too much and take my pain in patience.

To Reproduce Steps to reproduce the behavior:

  1. Install and use Sunshine
  2. Use moonlight on client side (windows, iOS, tvOS or linux)
  3. Resolution and Bitrate don't matter
  4. Available ffmpeg tuning settings on sunshine don't matter
  5. HEVC or H264 have same issue
  6. Target framerate is 60 FPS. But with 30 FPS, the impact is divided by 2.

FFmpeg settings with AMF encoder Quality : speed / balanced / quality Rate control : cqp / vbr_latency / vbr_peak / cbr Usage : ultralowlatency / lowlatency / webcam / transcoding Preanalysis : true / false VBAQ : true / false

FFmpeg hardcoded settings (HEVC) filler_data : true (same with enforce_hrd : true) gops_per_idr : 1 header_insertion_mode : "idr" qmax : 51 qmin : 0

Setup (please complete the following information):

Thank you for any help or clarification you can give me

MikhailAMD commented 1 year ago

Can you record short GPUVIEW log on the server (5-10sec), ZIP it and share for RX7900XTX and preferably for 6950 XT? Can you try to set "-lowlatency true" for FFmpeg? What is your stream resolution and stream resolution? If they are different, what is used for scaling?

HakanFly commented 1 year ago

Can you record short GPUVIEW log on the server (5-10sec), ZIP it and share for RX7900XTX and preferably for 6950 XT?

My GPUView log (RX 7900 XTX) https://gofile.io/d/CDGQVu It's during a CP 2077 benchmark. I started the stream some seconds after starting log. I can't doing the same with 6950 XT.

Can you try to set "-lowlatency true" for FFmpeg?

Sorry, I don't know how. This may already be the case. I can't understand in the source code how to pass parameters to FFmpeg.

What is your stream resolution and stream resolution?

4k for both. But I have same problem with lower resolution.

MikhailAMD commented 1 year ago

Thanks for the log. Encoder run OK, no issue here. I see Sunshine GPU operations that interfere with game but not drastic. Render time drops from 16.6ms to 18.3ms How bad is performance drop in numbers? How do you measure it? One thing you can try - disable overlay in Radeon Software->Settings->Preferences->In-Game Overlay Can you try Parsec?

HakanFly commented 1 year ago

How bad is performance drop in numbers ?

I usually limit my FPS to 60 FPS. In menu game, I can see a 20% increase in GPU usage once I start a stream. When I set ma FPS limit to uncapped (with 4K 60 FPS stream) : Spider-Man remaster, ~50 FPS drop The Last of Us Part 1, ~20 FPS drop Hogward Legacy, ~20 FPS drop

How do you measure it?

Adrenalin stats overlay or RTSS. And I hear it, my GPU goes crazy every time I start a streaming session

One thing you can try - disable overlay in Radeon Software->Settings->Preferences->In-Game Overlay

I already do this and use RTSS when necessary.

Can you try Parsec?

I will give a try. But I know with AMD Link or Steam Remote Play don't have any impact in performance.

MikhailAMD commented 1 year ago

I saw traces of In-Game Overlay or RSR in GPUVIEW log. See High priority queue 3D. I do see effects of Desktop Duplication API that are used for capture by Sunshine but again, they are not drastic. Can you provide not only FPS drop values but FPS without Sunshine? Sunshine is using AMF integration in FFmpeg. I don't see any problem or influence of encoder. See Video Codec Queue. If you want, you can try AMF DVR sample. It can switch between DD and AMD capture.

HakanFly commented 1 year ago

Parsec has an impact on my performance in-game, but less than with Sunshine. But it is the least usable of all (worse than Steam Remote Play). (Example with Hogward Legacy or Spider-man)

CP 2077 : ~73 FPS AVG (4K Native) / ~80 FPS AVG (4K FSR 2.1 Quality) / ~61 FPS AVG (4K native and Sunshine) Spider-man Remaster : 193 FPS (4k Native) / 206 FPS (4K FSR 2.1 Quality) / 147 FPS (4k Native and Sunshine) / 180 FPS (4k Native and Parsec) / 193 FPS (4K Native and AMD Link) The Last of Us Part 1 (very random performance) : ~84 FPS (4K FSR 2.2 Quality) / ~64 FPS (4K FSR 2.2 Quality and Sunshine) Hogward Legacy (modded) : 166 FPS (4K FSR 2.0 Quality) / 138 FPS (4K FSR 2.0 Quality and Sunshine) / 156 FPS (Parsec) / 166 FPS (AMD Link)

MikhailAMD commented 1 year ago

OK, thanks. In short losses in %: CP 2077 Sunshine 16% Spider-man Remaster Sunshine: 24% Parsec: 7% AMD Link: 0% The Last of Us Part 1: Sunshine 24% Hogward Legacy: Sunshine: 17% Parsec: 6% AMD Link: 0%

There are two features involved here:

  1. Color conversion using GFX shaders: Sunshine is using it, while Parsec and AMD Link don't use it submitting RGBA surfaces directly to AMF encoder.
  2. Capture: Sunshine and Parsec are using Desktop Duplication API which adds some copy operations on GFX, while AMD Link is using AMD DirectCapture component from AMF. As you can see, these performance drops are due to differences in application implementations, not bugs. Based on GPUVIEW encoder is not a bottleneck. I think if you run the same tests on RX6950XT you will see similar situation.
HakanFly commented 1 year ago

Can the GPUView log be biased because I made a capture while my game was capped at 60 FPS? So no framerate drop during log ?

I'm going to make another one with the game with FPS completely uncapped.

MikhailAMD commented 1 year ago

Could be. You should record the failing case.

HakanFly commented 1 year ago

GPUView log with Spider-man Remastered 4K Native : https://gofile.io/d/21rN4t (I realize I forgot to turn off RTSS)

MikhailAMD commented 1 year ago

Yes, the effect is more severe, but roots are the same: color conversion and capture interfere with game. See GPUVIEW with comments: image

HakanFly commented 1 year ago

Thank you very much for all these explanations. The help is really appreciated. There is little chance of performance improvements, except perhaps by migrating to AMD DirectCapture. ^^'

MikhailAMD commented 1 year ago

Using submissions of RGBA surface/textures directly to AMF encoder would help a lot. It is supported in FFmpeg AMF integration but not used by Sunshine

cgutman commented 1 year ago

Thanks for the info, Mikhail. I am one of the developers of Sunshine and I've been working on investigating this performance gap between AMD and NVIDIA GPUs in Sunshine.

For some background on Sunshine's capture and encoding design on Windows, we use 2 separate threads and ID3D11Device objects for encoding and capture to allow us to parallelize encoding and capture. There is a image pool of RGBA ID3D11Texture2D objects (with dimensions matching the screen) that are shared between the D3D11 devices and synchronized using the IDXGIKeyedMutex interface.

Each time a frame is due to be captured and encoded (depending on client's frame rate setting), the capture thread receives the surface from AcquireNextFrame(), acquires the keyed mutex for the shared texture we're going to capture into, copies the DWM surface into that texture, renders the cursor on top (if it's visible), and finally releases the keyed mutex. The code for that is here: https://github.com/LizardByte/Sunshine/blob/d70d084f9fbb4e0150977a89d94937418a3ccf9c/src/platform/windows/display_vram.cpp#L796-L982

On the encoder thread, we wake up when a newly updated shared texture has been pushed into the queue for encoding. When that happens, convert() is called which will acquire the keyed mutex, draw both Y and UV planes to their respective render targets (which point to a NV12 texture inside the AVFrame that we will submit to FFmpeg), and finally release the keyed mutex. After convert() returns, we pass the AVFrame into avcodec_send_frame() for encoding. The code for that is here: https://github.com/LizardByte/Sunshine/blob/d70d084f9fbb4e0150977a89d94937418a3ccf9c/src/platform/windows/display_vram.cpp#L313-L350 (the Flush() in there is unnecessary but removing it doesn't have any performance impact in my testing).

As far as I can tell, we're not doing anything totally crazy here that would explain a > 25% performance loss in game on RDNA 3. When using an NVIDIA GeForce GTX 1080, I see only about a 5% performance loss in my tests. Can you tell if we're unknowningly doing something that has a significant performance impact on AMD hardware in particular?

In my own investigation, I spent a couple hours today testing various things on my RX 7900 XT trying to narrow down the cause of the performance deficit. I mainly focused on convert(), since that code is relatively simple to comment parts out and see what happens.

For my baseline in Horizon Zero Dawn at 4K Ultra settings, I got 93 FPS in-game when Sunshine was not running. The game frame rate dropped to 68 FPS when I started streaming with Sunshine without any code changes. When I commented out both Draw() calls, my frame rate increased to 74 FPS, so clearly this codepath is having some adverse performance impact. Interestingly though, if I only comment out the second Draw() call, the performance drop to 68 FPS still occurs. Furthermore, even modifying the convert_Y_ps pixel shader to always return 0.0f still resulted in a drop to 68 FPS. It seems that doing any Draw() calls at all in convert() is sufficient to cause a large performance drop, even if the shader is doing nothing.

To summarize my perf testing: Baseline - 93 FPS Sunshine (unmodified) - 68 FPS Sunshine with one draw call in convert() - 68 FPS Sunshine with one draw call in convert() and no-op pixel shader - 68 FPS Sunshine with no draw calls in convert() - 74 FPS

This behavior seems very strange to me. The overhead of that single Draw() call in convert() on the RX 7900 XT is greater than the entire overhead of all of Sunshine's capture and encoding on the NVIDIA GeForce GTX 1080 in my other test machine.

Using submissions of RGBA surface/textures directly to AMF encoder would help a lot. It is supported in FFmpeg AMF integration but not used by Sunshine

I can try passing RGBA textures directly to FFmpeg, but it doesn't seem like it would make a huge difference based on my testing today. We still have to make at least one Draw() call in convert() to perform the aspect-ratio scaling from the desktop size to the size of the video frame (or a CopyResource() at the very least to copy into the AVFrame). If that Draw() call still results in a large perf overhead, it doesn't seem like it would improve anything (unless AMF/FFmpeg is doing a bunch of work that would be unnecessary on RGBA frames, but that seems unlikely).

Thanks for your time investigating this. I'd appreciate any insights you might have on our approach here and any possible reasons why it's performing much worse on RDNA 3 hardware.

MikhailAMD commented 1 year ago

@cgutman : nice to meet you. Few random notes:

  1. Submitting RGBA to the encoder directly only helps when there is no GPU-based scaling, cropping or aspect ratio operations. If 3D shaders are used, they will interfere with the game regardless. In all AMD streaming or recording solutions we are trying to do these operations on the client (except scaling down).
  2. On GPUVIEW above time of Sunshine GPU operations is clearly visible as well as DD API copy operations. They do interfere with the game dropping framerate to the values you mentioned. The interference is on GPU - GFX queue, not much can you do at CPU to improve.
  3. Using two threads and two D3D device context is a common technique but IMHO it could be achieved with one thread and one device context as most of operations are on GPU.
  4. You may want to check AMF DVR sample which can switch between AMD DirectCapture and DD API. In the app you can enable AMF scaling if capture and recording resolutions are different.
  5. If you provide GPUVIEW logs for AMD and Nvidia, I can check why they work differently.
  6. One way to get less game interference is to move all shaders to Compute HW queue: using OpenCL or D3D12 Compute. But with them interop to and from D3D11 should be carefully implemented. BTW: AMF has it when one calls surface->Convert().
cgutman commented 1 year ago

Thanks for your advice.

  1. Submitting RGBA to the encoder directly only helps when there is no GPU-based scaling, cropping or aspect ratio operations. If 3D shaders are used, they will interfere with the game regardless. In all AMD streaming or recording solutions we are trying to do these operations on the client (except scaling down).
  2. On GPUVIEW above time of Sunshine GPU operations is clearly visible as well as DD API copy operations. They do interfere with the game dropping framerate to the values you mentioned. The interference is on GPU - GFX queue, not much can you do at CPU to improve.

Does the hardware/driver/OS allow execution of GPU workloads on the 3D queue from multiple processes concurrently? The GPUView data seems to show that only one process (excluding System) can execute 3D queue jobs at a time.

That seems to be the core issue here. Even though Sunshine's shaders are very simple, it seems like dispatching any 3D work from our process is enough to completely block the game's 3D work from making progress for a significant period of time (seems like some fixed context switching overhead or something).

Using two threads and two D3D device context is a common technique but IMHO it could be achieved with one thread and one device context as most of operations are on GPU.

We actually used to have a single thread, but separating them increased performance because it allowed capture and color conversion to happen in parallel with encoding.

You may want to check AMF DVR sample which can switch between AMD DirectCapture and DD API. In the app you can enable AMF scaling if capture and recording resolutions are different.

I looked at the sample and the documentation, and I think supporting DirectCapture should be possible for us. I have a couple of initial questions:

One way to get less game interference is to move all shaders to Compute HW queue: using OpenCL or D3D12 Compute. But with them interop to and from D3D11 should be carefully implemented. BTW: AMF has it when one calls surface->Convert().

Does DX11 DirectCompute go to the Compute HW queue too? I think that would be a smaller change than OpenCL or D3D12 since we already using D3D11 today.

Could our compute HW queue work be executed concurrently with 3D queue work from the game?

MikhailAMD commented 1 year ago
cgutman commented 1 year ago

Thanks for the advice regarding AMF. Per your earlier suggestion, I have measured overhead and collected some GPUView traces on several test GPUs for you to take a look at.

Sunshine appears to have roughly double the performance overhead when running on an RX 7900 XT compared to an RTX 4080 or Arc A770. Hopefully you can see something in these traces that explains the cause of the additional overhead on AMD hardware compared to other GPU vendors.

Game and Settings: A Plague Tale: Requiem 2560x1440 Vsync Off Ultra (No RT)

Intel Arc A770: Driver - 31.0.101.4255 Baseline - 65 FPS Sunshine - 59 FPS (90%) GPUView - https://drive.google.com/file/d/14VbFcj-6iEwfW1UwMuKAniYP7pTZOela/view?usp=sharing

NVIDIA GeForce RTX 4080: Driver - 531.61 Baseline - 147 FPS Sunshine - 131 FPS (89%) GPUView - https://drive.google.com/file/d/1-yHkhMgy144OqdRc7o7AE5WViPCF6utU/view?usp=sharing

AMD Radeon RX 7900 XT: Driver - 23.4.1 Baseline - 136 FPS Sunshine - 110 FPS (80%) GPUView - https://drive.google.com/file/d/1CGMWtPCzCfIqru2ScFalKzbVQUj9PIeb/view?usp=sharing

MikhailAMD commented 1 year ago

Hi, here are few random thoughts and observations:

psyke83 commented 1 year ago

Hello,

Since nobody has yet provided logs for a RDNA2 card, I'm sharing two captures of Cyberpunk 2077's built-in benchmark. Sunshine is using the default ffmpeg settings for AMF (seen here: https://github.com/LizardByte/Sunshine/blob/21eb4eb6ddb76e4918ec76a77d8d6d73a587ec0d/src/config.cpp#L351-L361), capturing at 4K 60FPS in both cases. The game is running at a Medium-High mixed preset, with FSR 2.1 Ultra performance (as this card is not really suitable for 4K, but I wanted to share logs at the same resolution). Additionally, the host PC runs at 4K via VSR, as the connected monitor is only 1080P capable.

System: Windows 11 22H2, Ryzen 7 2700X, 24GB RAM, MSI RX 6600 MECH 2X 8G

AMD Radeon RX 6600 @ 4K: Driver - 23.4.2 Baseline - 82 FPS Sunshine - 72 FPS (87%) GPUView - https://drive.google.com/file/d/1SlwSZ7finlN9WfOJYrEr3rl5R6DjsrnA/view?usp=share_link

AMD Radeon RX 6600 @ 1440p: Driver - 23.4.2 Baseline - 117 FPS Sunshine - 109 FPS (93%) GPUView - https://drive.google.com/file/d/1aGmWu40yak1XOS2xs9qYyKzTeZiZPho6/view?usp=share_link

I normally run games at 1440p, and the typical performance reduction at this resolution seems to be about 6-7% across multiple games. Running at 4K almost doubles the performance drop, but I'm not sure if VSR's scaling overhead influences this result.

MikhailAMD commented 1 year ago

in 1440 log the game does FSR and present at 18ms (55fps) and probably renders at x2 speed - 110fps. Game jobs in GFX queue have gaps between frames and this explains less impact from Sunshine-submmitted GFX jobs - likely color conversion. In 4K log game also does FSR and present at 18ms (55fps). Not sure the actual render rate. But GFX queue is almost full, adding color conversion jobs from Sunshine has bigger impact on the game.

cgutman commented 1 year ago

I performed some additional performance testing with the new 31.0.22000.11008 (23.20.00.11) beta driver that I received through Windows Update (Insider Preview channel). This driver supports Hardware Accelerated GPU Scheduling, so I tested with HAGS on and off. I noted significant reductions in Sunshine overhead with HAGS enabled.

Game, Settings, and GPU: AMD Radeon RX 7900 XT 31.0.22000.11008 (23.20.00.11) Driver Horizon Zero Dawn 3840x2160 Vsync Off Ultimate Quality

HAGS Off: Baseline - 93 FPS - https://drive.google.com/file/d/1lxONblTLY9RGCdI6g4j3Xm78cRnW0odv/view?usp=sharing Sunshine - 76 FPS (81.7% of baseline) - https://drive.google.com/file/d/1dHDt70zY_ejhjfLscLTBoZJ205VQ07Pg/view?usp=sharing

HAGS On: Baseline - 93 FPS - https://drive.google.com/file/d/1D5ZuIXDSurZE5Z4fufl0mMrBTlWnrMIO/view?usp=sharing Sunshine - 88 FPS (94.6% of baseline) - https://drive.google.com/file/d/1tEcbIQPEC0WIBqKKavc1_cS-i5BFWbVh/view?usp=sharing

With the new beta driver, HAGS off overhead looks exactly the same as my previous tests back in April. However, with HAGS on, performance with Sunshine is significantly improved.

Is there a timeline for when Hardware Accelerated GPU Scheduling is planned to be available in the production drivers (or at least a driver available for download outside of Windows Update)?

MikhailAMD commented 1 year ago

Hard to tell why. The log for HAGS doesn't show some details for GPU jobs. It could be that DWM jobs have higher priorities and are more synchronized with space between game frames. Or there is less CPU overhead on job submissions. Also looks like DWM jobs are shorter. Unfortunately, I cannot comment on readiness of HAGS.

Smoukus commented 1 year ago

Any progress on this front? Still isn't running optimally with 6000 series and above.

HakanFly commented 1 year ago

@Smoukus Enabling HAGS gives excellent results with UWP preview drivers. And it's a workaround that suits me just fine until the drivers and/or open source projects are quietly updated.

Smoukus commented 1 year ago

@Smoukus Enabling HAGS gives excellent results with UWP preview drivers. And it's a workaround that suits me just fine until the drivers and/or open source projects are quietly updated.

Sadly I don't use UWP drivers, and therefore don't have the option to enable/disable HAGS.

cgutman commented 9 months ago

Hardware Accelerated GPU Scheduling is now officially supported in the official AMD WHQL drivers as of the 23.12.1 release today, so I think we can close out this issue report.

Thanks to everyone at AMD who worked to get HAGS support across the finish line!

HakanFly commented 9 months ago

I agree with @cgutman. Thank you all for your contributions and clarifications !