Intel-Media-SDK / MediaSDK

The Intel® Media SDK
MIT License
931 stars 458 forks source link

MFXVideoENCODE_Query parameter change will cause GPU hang later #1864

Closed guoyejun closed 4 years ago

guoyejun commented 4 years ago

hi

firstly, the behavior of MFXVideoENCODE_Query does not match the function name. It makes sense that query does not change anything. But looks that MFXVideoENCODE_Query does change something within msdk.

as discussed in https://github.com/Intel-Media-SDK/MediaSDK/issues/1817, multi frame encode and roi encode conflict, and cause GPU hang. But it turns out that GPU will hang later once MFXVideoENCODE_Query contains both mfe and roi in mfxVideoParam.ExtParam, see code at https://github.com/guoyejun/ffmpeg/commit/534712d3376c9557734b877f1c46a84743aaffbf, this patch just adds roi structure into the parameter (no need to enable roi at encode time, in other words, mfxEncodeCtrl.NumExtParam is still zero) and reproduces a gpu hang.

I think that a query parameter change should not cause a following GPU hang.

guoyejun commented 4 years ago

@DenWolf any plan when this issue can be fixed? thanks.

DenWolf commented 4 years ago

Hi @guoyejun ,

About the first part of your question:

But looks that MFXVideoENCODE_Query does change something within msdk.

Regarding Media SDK documentation function MFXVideoENCODE_Query can change some parameters if they are incompatible with other parameters, you can find it here: https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#MFXVideoENCODE_Query

About observed GPU hang - could you please provide command line for reproducing the issue (it will be great if it's reproducible by Media SDK samples) ? And what platform do you use? This information will be very helpful for analyzing the issue first and reproduce/debug it further.

Also, I have a question about the priority - what's the impact of this issue and how urgent it is? I asked because we have a public holidays soon (Jan 1-8), also we need to set correct priority of this task. Of course I will try to look at it (by last working days in this year) regarding your information, but if the issue will be hard to analyze/resolve fast and it will take deep development investigation - I think we can start full investigation not early than in 2 weeks (after holidays).

Every information from you will be very helpful :)

Best regards, Denis

guoyejun commented 4 years ago

thanks @DenWolf for your quick reply.

Hi @guoyejun ,

About the first part of your question:

But looks that MFXVideoENCODE_Query does change something within msdk.

Regarding Media SDK documentation function MFXVideoENCODE_Query can change some parameters if they are incompatible with other parameters, you can find it here: https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#MFXVideoENCODE_Query

I see. I meant that the function name 'query' has no meaning to change something. It would be more logical if the function name looks like query_and_fix or something else. I understand it's hard to change public API. Just mention it in case that you plan to do a change.

About observed GPU hang - could you please provide command line for reproducing the issue (it will be great if it's reproducible by Media SDK samples) ? And what platform do you use?

I'm using skylake + ubuntu 16.04.

I just quickly looked at the msdk samples, but did not find where to set mfxVideoParam.ExtParam in pipeline_encode.cpp.

The key changes to reproduce the issue is to add both mfe and roi into mfxVideoParam.ExtParam. The key code looks like: mfxExtEncoderROI extroi;
extroi.Header.BufferId = MFX_EXTBUFF_ENCODER_ROI; extroi.Header.BufferSz = sizeof(extroi); extroi.ROIMode = MFX_ROI_MODE_QP_DELTA; extroi.NumROI = 256; // due to the requirement of msdk, we must set non_empty rect for query for (int i = 0; i < sizeof(extroi.ROI)/sizeof(extroi.ROI[0]); ++i) { extroi.ROI[i].Right = 16; extroi.ROI[i].Bottom = 16; }

            mfxExtMultiFrameParam   extmfp;
            extmfp.Header.BufferId     = MFX_EXTBUFF_MULTI_FRAME_PARAM;
            extmfp.Header.BufferSz     = sizeof(extmfp);
            extmfp.MFMode = MFX_MF_AUTO;

Hope it helps. I also give the detail steps to reproduce the issue with ffmpeg below.

This information will be very helpful for analyzing the issue first and reproduce/debug it further.

Also, I have a question about the priority - what's the impact of this issue and how urgent it is?

I'm enabling ROI encoding in ffmpeg for qsv path. Without the fix, i just did a workaround in ffmpeg, but it is not accepted by the community. So the ROI encoding could not be enabled for qsv path in ffmpeg if it is not fixed.

To reproduce the issue with ffmpeg: git clone https://github.com/guoyejun/ffmpeg.git cd ffmpeg git checkout -b qissue qsvqueryissue mkdir build cd build ../configure --enable-libmfx make ./ffmpeg -s 352x288 -i input352x288.yuv -c:v h264_qsv -y qsv.h264

we'll see a gpu hang.

if we just revert https://github.com/guoyejun/ffmpeg/commit/0f510765852fc40a838d1ad462522cd79fcd0899, no gpu hang.

guoyejun commented 4 years ago

hi, could you reproduce this issue? thanks.

dmitryermilov commented 4 years ago

@DenWolf , @pgribov , please reply!

DenWolf commented 4 years ago

Hi @guoyejun , @dmitryermilov , We've return back to this task after holidays, now @pgribov is working on it and trying to reproduce it. As I know, @pgribov has met the issue with building custom ffmpeg (repro tool) - if he still not able to resolve the repro issues by himself we will contact with @guoyejun with our questions.

pgribov commented 4 years ago

Hi @guoyejun , Sorry for late response. Issue was reproduced, we are exploring ways to solve this problem.

guoyejun commented 4 years ago

thanks, and do we have an estimate when this will be solved? thanks.

guoyejun commented 4 years ago

any estimate when this will be solved? thanks

DenWolf commented 4 years ago

Hi @guoyejun ,

Sorry for the late response - our team was extremely busy last month, @pgribov and me had several tasks in parallel, which slowed down investigation of the current issue.

But we have an interesting update about the issue and found that it is not on Media SDK side - it's on driver side (https://github.com/intel/media-driver).

By preparing simple reproducer we've have found that issue with observed hang exists only on BRC mode (not reproduced on CQP) with MFE+ROI enabled and - the most important - it's related to SingleTaskPhase mode (https://github.com/intel/media-driver/blob/master/media_driver/agnostic/common/codec/hal/codechal_encoder_base.h#L1676) - when the singleTaskPhase mode is enabled (be default), issue is observed; when we disabled singleTaskPhase - issue is gone.

After looking to the source code of the driver and collecting dumps we suspect incorrect programming issue (looks like for the kernels - ENC stage for BRC case) which related to combination of SingleTaskPhase+MFE+ROI+BRC(ex: CBR)

dmitryermilov commented 4 years ago

@DenWolf , is it a bug or "per-design" limitation? If it's a "per-design" limitation, can we disable RIO (in case of MFE) via corresponding libva caps?

DenWolf commented 4 years ago

@dmitryermilov - we need more time for the further investigation and understanding the root cause of this issue - is it a bug or it's expected. For now it's trending as a bug for me, but I'm totally agree with you that if we will find that this is expected limitation, this case should be definitely disabled in CAPS

guoyejun commented 4 years ago

thanks.

From MSDK user's perspective, it is weird that query parameter could cause gpu hang.

Is it possible that MSDK first do a tiny change to solve this issue? For example, just ignore ROI internally when both ROI and MFE are enabled in MFXVideoENCODE_Query.