GPUOpen-LibrariesAndSDKs / AMF

The Advanced Media Framework (AMF) SDK provides developers with optimal access to AMD devices for multimedia processing
Other
599 stars 149 forks source link

Behavior of converter & decoder components while using Windows DXGI's AquireFrame in multithreaded environment #226

Closed smourier closed 3 years ago

smourier commented 4 years ago

Hi,

I'm using the AMFVideoConverter & AMFVideoEncoderVCE_AVC components with this pseudo code:

thread 1:

while(capturing)
{
   acquireframe(1000, ...) // dxgi
   if (ok) push new frame (GPU copy)
   if (timeout) continue (keep last frame)
}

thread 2:

while(pop frame) // keep the last
{
  CreateSurfaceFromDX11Native(frame)
  convert(rgb => nv12, change size) // submit input + query output
  encode(nv12 => h264) // submit input + query output loop
}

All traces and asserts and debug level are enabled in every component (AMF, dxgi, d3d11, etc.), I have zero traces that tells me something's wrong.

What I observe is convert+encode pass (always finishes) but can take seconds to finish if there are timeouts in DXGI (which is normal and just means the screen has not changed).

So, converter & encoder in thread 2 seems to be tied to thread 1 which I really don't understand why. In thread 1, I just create new D3D11 GPU 2D textures, copy from the acquired frame, and push them in a list.

I have read an interesting comment in DDAPISource.cpp:

// MM AcquireNextFrame() blocks DX11 calls including calls to query encoder. Wait ourselves here

Which could explain what I see, but I don't understand that statement. Is there some global/implicit lock when AcquireFrame is called? where? at driver level? is it AMD specific? what's the "query encoder"?

How could I avoid this? Is there a way?

PS: if i use a small timeout (like 0..15 for ex) and just do a regular acquire/convert/encode loop w/o any threads, it works technically fine, but CPU+GPU usage seems too high and I'd like to use multiple threads.

Basically I'd like to reproduce the mechanism of Microsoft's sample https://github.com/microsoft/Windows-classic-samples/tree/master/Samples/DXGIDesktopDuplication (multi thread with a 500 timeout) but encode to H264 instead of display.

MikhailAMD commented 4 years ago

Yes, I put this comment. The problem is that encoder and converter use D3D11 to submit jobs to GPU. If AcquireNextFrame is called with non zero timeout value, it blocks access to D3D11 for all other threads: converter and encoder SubmitInput and QueryOutput. To avoid this the only way is to use zero timeout and do pacing / timing by sleep call as it is done in AMF sample code. This is OS behavior, not much we can do.

smourier commented 4 years ago

Hi Mikhail, thanks for this response.

Why we don't have this behavior when we do other kind of DXGI/D3D11 operations, like copying the resource to present it on a swapchain? Or even transfer it on the CPU.

If you test the Microsoft DXGIDesktopDuplication sample, it uses multiple threads and has no problem using a 500ms timeout on AcquireNextFrame (and it uses minimal CPU/GPU).

Actually if you test that sample and change 500 to 0, you'll see CPU usage increase (on my machine from around 1% to 20%, GPU stays the same around 5%) which is understandable.

When you say "AcquireNextFrame blocks access to D3D11 for other threads", I really don't understand what you are talking about precisely. Can you shed some more light on this? If it's in the OS it seems a huge design flaw.

roman380 commented 4 years ago

@smourier have a look at OpenSharedResource in Microsoft's sample. They copy DD texture into texture in another device through sharing of DXGI resources/textures between multiple device. So all further processing takes place with another non-DD D3D11 device. This way DD's blocking has no impact. You would probably want to implement the same thing in your app (have one device for DD and another for all AMF processing).

smourier commented 4 years ago

Hi Roman,

When I run this sample, I see only one ID3D11Device device used (in the normal course of operations). They do use multiple textures and make copies, but afaik, they're on the same device?

Actually in my code, I also copy textures, I don't give AMF converter/encoder the same texture as AcquireNextFrame does, I copy it too. But it doesn't change anything.

Or maybe I don't understand what you mean?

roman380 commented 4 years ago

The code is structured in a way which is not very obvious, but I think they do use a mix of two devices (in ThreadManager.cpp and OutputManager.cpp). I think the lock which is interfering with your processing is coming not from use of same textures, but rather device lock of DD API while it is in wait for new frame. So I think the "better" performance Microsoft's sample is directly related to their use of a separate presentation device.

smourier commented 4 years ago

You're right the code is not obvious :-) however I debugged it and unless I'm mistaken, I see only one ID3D11Device device used. I actually got the multithreaded approach working fine with Intel hardware if I remember correctly. At least not with the "seconds" lock on one single frame encoding. When I get some time, I may try to adapt Microsoft sample with AMF

MikhailAMD commented 4 years ago

I checked the sample, though not debugged. My understanding is that the sample run multiple threads. Each thread creates own D3D device, captures a frame and copies it - all done in the loop of the thread. The copy is done to a shared texture- shared between all threads with mutex syncronization. There is also an output thread which has own D3D device and output swap chain, and this thread draws the shared surface to the screen. IMHO, in the sample each D3D device is used in a single thread, not like @smourier or AMF sample do. As an alternative to the solution AMF sample does, you can follow the idea of MSFT sample: two devices and shared texture. In conclusion: I investigated this problem for a while, from AMF and driver point of view. This is D3D MSFT runtime behavior which sits between AMF runtime and driver - so far I didn't find anything meaningful. It looks like as D3D11 is single-threaded by design, some synchronization inside runtime was overused (or done intentionally).