Abnormal Encoding Latency Increase and Screen Flickering

davidlinbird commented 7 years ago

I built a streaming software using AMF VCE to do the encoding part. Initially, I use Low Latency mode with quality setting. I got perfect screen quality and 16 ms average encode latency.(I am sure about this). However, I tested the program again a week after initial test and the result changes significantly. Now the screen looks flickering especially for small letters and detail images. In addition, the encode latency increases to 25-30 ms. Two tests use the same settings, same program, and same hardware. Right now I cannot reproduce the result I got from initial test. Now, the screen quality of Low Latency mode and Ultra Low Latency mode is barely watchable and I need to use Transcoding mode. I got perfect screen quality with Transcoding mode and about 29-40ms latency.

**My questions

Are there ways to get rid of flickering with Low Latency mode and Ultra Low Latency Mode?
What may be the reason of sudden increase of latency with exact same settings, same program and hardware?
I cannot reproduce initial result. What is your suggestion about reproducing my initial test result?**

I am using XFX RX460 2GB and the resolution is 1920x1080. Now, I got 22ms with Low Latency mode and 29ms with Transcoding mode. These are the lowest latency I got now.

I only changed a few settings and all the others remains default settings. Here are the settings I changed. res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_USAGE, AMF_VIDEO_ENCODER_USAGE_TRANSCONDING); res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_TARGET_BITRATE, bitRateIn);//25mbps res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_FRAMESIZE, ::AMFConstructSize(scrnWidth, scrnHeight));//1920*1080 res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_FRAMERATE, ::AMFConstructRate(frameRateIn, 1));//60fps res = a_Encoder->Init(formatIn, scrnWidth, scrnHeight);//AMF_SURFACE_BGRA

Here is the settings header file. VideoEncoderVCE.txt

MikhailAMD commented 7 years ago

I would have to ask several questions and provide some general thoughts to help you:

Flickering cannot be caused by encoding parameters and by encoder itself but can appear when parameters changed due to the changes in overall timing if there is some pipeline issue.
Flickering may mean many things and usually appears if some GPU synchronization is missing. Please provide more information how do you render and allocate /free input frames.
Please provide a code snippet with AMF context initialization (related to DX9, DX11, OpenCL, OpenGL etc)
You initialized the encoder with BGRA -- this means that you are using built-in color space converter from BGRA to NV12. It run on GPU and can interfere with other processes like a game.
TRANSCODING usage gives maximum flexibility from parameter point of view but for lowest latency you should set ENCODER_QUALITY to SPEED.
A short GPUVEW ETL log file (3-5 sec) may help Let's start with these items.

davidlinbird commented 7 years ago

Below are the codes that show how I initialize AMF.


res = g_AMFFactory.Init();
::amf_increase_timer_precision();
res = g_AMFFactory.GetFactory()-&gt;CreateContext(&a_Context);
res = a_Context-&gt;InitDX11(m_Device); // can be DX11 device
// component: encoder
res = g_AMFFactory.GetFactory()-&gt;CreateComponent(a_Context, pCodec, &a_Encoder);

res = a_Encoder-&gt;SetProperty(AMF_VIDEO_ENCODER_USAGE, AMF_VIDEO_ENCODER_USAGE_LOW_LATENCY);
res = a_Encoder-&gt;SetProperty(AMF_VIDEO_ENCODER_B_PIC_PATTERN, 0);
res = a_Encoder-&gt;SetProperty(AMF_VIDEO_ENCODER_QUALITY_PRESET, AMF_VIDEO_ENCODER_QUALITY_PRESET_SPEED);

res = a_Encoder-&gt;SetProperty(AMF_VIDEO_ENCODER_TARGET_BITRATE, bitRateIn);//25m
res = a_Encoder-&gt;SetProperty(AMF_VIDEO_ENCODER_FRAMESIZE, ::AMFConstructSize(scrnWidth, scrnHeight));
res = a_Encoder-&gt;SetProperty(AMF_VIDEO_ENCODER_FRAMERATE, ::AMFConstructRate(frameRateIn, 1));60fps

res = a_Encoder-&gt;Init(formatIn, scrnWidth, scrnHeight);RGBA,1920x1080

Below are the codes that show how I render and allocate input frames.


D3D11_TEXTURE2D_DESC FrameDesc;
     //Note: m_SharedSurf comes from DXGI desktop duplicate component.
m_SharedSurf-&gt;GetDesc(&FrameDesc); 

amf::AMFSurface* amdInputSurface = nullptr;
auto amdres = a_Context-&gt;AllocSurface(amf::AMF_MEMORY_DX11, amf::AMF_SURFACE_BGRA£¬FrameDesc.Width, FrameDesc.Height, &amdInputSurface);
amdInputSurface-&gt;SetProperty(START_TIME_PROPERTY, amf_high_precision_clock());
auto amdSurf = reinterpret_cast<id3d11texture2d*></id3d11texture2d*>(amdInputSurface-&gt;GetPlane(amf::AMF_PLANE_PACKED)-&gt;GetNative());
m_DeviceContext-&gt;CopyResource(amdSurf, m_SharedSurf);
amdres = a_Encoder-&gt;SubmitInput(amdInputSurface);

There is no others game/app used GPU to interfere color space converter at the same time.

MikhailAMD commented 7 years ago

OK:

what is m_SharedSurf?
I it allocated on a different device or in different process and shared via shared handle?
How it is populated?
Is it filled on the same thread?

davidlinbird commented 7 years ago

m_SharedSurf comes from DXGI desktop duplicate component.

MikhailAMD commented 7 years ago

OK, here is a potential problem: this texture is filled-in by a copy inside DD API and you start texture copy on the device shared with AMF but you do not wait till the copy is complete. So I suggest to insert after CopyResource(): context->Flush(): and maybe: use D3D11_QUERY_EVENT and loop

to ensure that copy is complete before you call IDXGIOutputDuplication::ReleaseFrame() Another thing to check: How do you create DX11 device? Ensure that you do not use this flag: D3D11_CREATE_DEVICE_SINGLETHREADED

davidlinbird commented 7 years ago

What is m_SharedSurf? It's a pointer to ID3D11Texture2D. It stores the constructed mirror image of the desktop being captured.
We do use shared handles, to share access the surface between multiple threads. But we don't have multiple video cards involved. Neither do we use multiprocessing — just multithread.
We have two threads involved in this program — duplication thread and result thread. I. On duplication thread, we poll "move" and "dirty" changes from IDXGIDesktopDuplication. Then we render those changes (on the same thread) to m_SharedSurf. (This is how the desktop duplication API works — not by returning a picture but by returning only the changes)

II. And then we use a mutex to notify the "result" thread to take back the result. (So if the result thread is blocked for too long then we may miss some frames)

III. On result thread, while holding the mutex, we:
```
       a. First allocate a surface "amdInputSurface" using AMD API.
       b. Then call "->GetPlane(amf::AMF_PLANE_PACKED)->GetNative()" to get an ID3D11Texture2D interface to that surface (called amdSurf).
       c. Preprocess the received m_SharedSurf
       d. use "ID3D11DeviceContext::CopyResource" to copy it to amdSurf.
       e. Do "ID3D11DeviceContext::Flush"
       f. Call (amd encoder component) -> SubmitInput(amdInputSurface)   // !!! we don't know whether this is synchronous or asynchronous
       g. Call amdInputSurface->Release()
```

MikhailAMD commented 7 years ago

The only reason why you would use shared handles is if you use two D3D11 device objects running on different threads and you share D3D11 texture between them. Please confirm. Also please explain why do you need two threads. It sounds unusual.

davidlinbird commented 7 years ago

We used the example code from Microsoft on DXGIDesktopDuplication and has no idea why Microsoft chose such a design. Frankly, there were many poorly designed code in their examples

davidlinbird commented 7 years ago

After getting advice from you, I notice that there might be some bugs in my streaming software which cause these problems. Thus, I develop a small AMF program to avoid these bugs. This program use the same code of my streaming program to use the AMF encoder. Also, instead of taking input from DXGL, I make the program always encode the same video which is pre-recorded.

Here are some interesting things I find out. Improvement:

There is no flickering on the screen no matter using Low Latency or Transcoding. The screen all looks great.
I achieved average latency of 13ms with speed mode and 21ms with quality mode.

Problems:

Maximum latency does not occur in encoding the very first frame but usually occurs after first 200 frames.
Encoding NV12 input(19ms) is slower than encoding RGBA input. 5.The latency of transcoding and the latency of low latency are the same. It is both 13ms in speed mode and 21ms in quality mode.

I have changed my encoder settings after getting your advice. Here are the settings I use currently.

formatIn=amf::AMF_SURFACE_BGRA
frameRateIn=60
bitRateIn=3500000
scrnWidth=1920
scrnHeight=1080

        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_PROFILE, AMF_VIDEO_ENCODER_PROFILE_HIGH);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_PROFILE_LEVEL, 52);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_FULL_RANGE_COLOR, false);    
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_FRAMERATE, ::AMFConstructRate(frameRateIn, 1));
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_RATE_CONTROL_METHOD, AMF_VIDEO_ENCODER_RATE_CONTROL_METHOD_CBR);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_RATE_CONTROL_PREANALYSIS_ENABLE, AMF_VIDEO_ENCODER_PREENCODE_DISABLED);

        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_RATE_CONTROL_SKIP_FRAME_ENABLE, false);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_MIN_QP, 18);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_MAX_QP, 51);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_TARGET_BITRATE, bitRateIn);

        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_PEAK_BITRATE, bitRateIn);

        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_DE_BLOCKING_FILTER, true);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_FILLER_DATA_ENABLE, false);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_ENFORCE_HRD, false);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_ENABLE_VBAQ, false);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_IDR_PERIOD, 120);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_VBV_BUFFER_SIZE, bitRateIn);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_INITIAL_VBV_BUFFER_FULLNESS, 64);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_MOTION_HALF_PIXEL, true);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_MOTION_QUARTERPIXEL, true);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_MAX_NUM_REFRAMES, 4)；
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_QUALITY_PRESET, AMF_VIDEO_ENCODER_QUALITY_PRESET_SPEED);

//      res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_TARGET_BITRATE, bitRateIn);
        res = a_Encoder->SetProperty(AMF_VIDEO_ENCODER_FRAMESIZE, ::AMFConstructSize(scrnWidth, scrnHeight));

    res = a_Encoder->Init(formatIn, scrnWidth, scrnHeight);

MikhailAMD commented 7 years ago

OK, we have some progress. Before we switch to the encoder please note that if you have two D3D11 device objects and use shared handle to share a texture it is application's responsibility to synchronize access to the texture. To make life worse every call to D3D11deviceContext is not executed immediately but marshalled to an internal thread - one per device object. At the same time GPU HW queue is serial. If not synchronized there is no guarantee which device will submit job first, regardless your thread mutex. So the only way to synchronize is to wait on CPU for GPU completion using D3D11 query. Now: encoder: It is hard to believe that RGBA submission is faster then NV12 because RGBA requires color conversion to NV12. Beside parameters it is important to properly measure latency and frame rate. If application submits too fast it can achieve maximum frame rate but latency can be big due internal HW queue. AMF has SimpleEncoder sample that accurately measures both parameters. If you want to share some experiments with me you could just modify the sample by setting the encoder parameters to your needs, check the results and send the modified CPP to me. The sample is encoding at full speed - transcode style. If your goal is to achieve minimal latency you should implement one-in-one-out model. For this you would need an event that signals from the polling thread when a frame is ready and submission thread should wait this signal before submission. This is not highest FPS but lowest latency. Lastly; a short ETL from GPUVIEW will give a lot of timing information.

davidlinbird commented 7 years ago

Here are the ETLs. CaptureState&Kernel.zip NoCaptureState.etl.zip Merged.etl.zip

davidlinbird commented 7 years ago

Does one-in-one-out model significantly lower the FPS? My goal is to reach 1080 p with 40 to 60 frames.

MikhailAMD commented 7 years ago

I am only interested in Merged.etl. Yes, you can get 60 FPS for 1080p with one-in-one out. From ET: you run the app at 30 fps but there is plenty of space. Please note that encode tasks can run in parallel to the next GFX task. Encode tasks take just 8-10 ms and rendering + color converter takes about 2.8 ms. Merged.zip Check the JPG in the zip with comments.

davidlinbird commented 7 years ago

I have implemented the encoder in one-in-one-out model. Now the encoding latency is around 13ms including color converter. However, there is a new problem. When we run some GPU benchmark test and AMF encoder at the same time, the encoding latency significantly increase to 20-40ms. I remember you mentioned that color converter may be impacted by other process like games and I think this is the problem. What are your suggestions about this problem besides avoid RGBA submission? Or do you know any methods to capture screen with NV12 format?

MikhailAMD commented 7 years ago

OK, few thoughts:

There is no capture to NV12 texture since display works as RGB. So color conversion is unavoidable.
In your setup color convertor run on GFX queue - the same as benchmark app or games. GFX queue is a shared resource and jobs from all processes are serialized so interference is unavoidable. Color converter job is put in the queue and has to wait to be executed.
AMD hardware has several independent HW queues: GFX, Compute, VCE (encoder), UVD (decoder). GFX and Compute queues share compute units but can run in parallel, Decoder and Encoder can run truly in parallel with any queue. Though, synchronization between queues still happens and may affect things.
Based on this you should see in GPUVIEW encode jobs running in parallel with GFX and you should reach 60 FPS though latency will depend on GFX queue load.
GPU benchmark test is an intentional extreme case, most of the games will give better results.
As you saw I left Compute queue. There are certain cases when using it may improve performance. The problem is in synchronization between GFX and Compute queues that should depend on GFX queue load.
You can try to use the Compute queue. To switch your app after InitDX11(device), call you should call InitOpenCL(NULL). There is no guarantee though.
Please try to minimize load on GFX queue: you can wrap D3D11 texture that you get from DD API into AMF surface and use it directly rather then do copy via CopyResource. Check CreateSurfaceFromDX11Native()).
In the future when Vulkan has multimedia functions and AMF supports Vulkan an application and AMF will have more control over use of all these queues and performance could be improved. Stay tuned for this.
If you can could, you please share with me information about you app. I keep records of AMF integrations for internal reasons and as a base for resource allocation requests. You can do it in private if you have to: find my Email in GDC 2017 presenter list.

davidlinbird commented 7 years ago

I have tested VCE encoding and DEM encoding with R9380 in 2016 and I remember the encoding latency didn't increase but remained the same while running benchmark test or games simultaneously. Does this issue start happening from VCE 3.0?

MikhailAMD commented 7 years ago

DEM was a HW feature not using GFX and behaved differently but it is retired.

MikhailAMD commented 7 years ago

Any update?

davidlinbird commented 7 years ago

Thank you Mikhail for all the help and followup on our project. I am very surprise that you still remember me and follow my project. I give up solving the GFX issue since it only happens during benchmark test which is a rare case. I am very interesting to the encoding/decoding performance with new Vega card especially H265 performance. Can you share any information about the encoding performance of Vega card?

Best David

2017-08-29 14:09 GMT-07:00 Mikhail_AMD notifications@github.com:

Any update?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GPUOpen-LibrariesAndSDKs/AMF/issues/92#issuecomment-325804987, or mute the thread https://github.com/notifications/unsubscribe-auth/Abx3SrUucnhOLjyF3IZvbie_9C52Bipoks5sdH4JgaJpZM4NtWt_ .

Xaymar commented 7 years ago

Vega is about twice as powerful as Polaris in H264 and about 2-2.25x as powerfuly as Polaris in H265/ HEVC. Actual numbers will vary depending on installed HW and usage.

MikhailAMD commented 7 years ago

Take a look into white paper on Vega. It has Encoder/Decoder section: http://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf

MikhailAMD commented 5 years ago

Closed as stale issue

GPUOpen-LibrariesAndSDKs / AMF

Abnormal Encoding Latency Increase and Screen Flickering #92

**My questions