Intel-Media-SDK / MediaSDK

The Intel® Media SDK
MIT License
931 stars 458 forks source link

[h264] GPU hang with my ROI encoding code on skylake+ubuntu16.04 #1817

Closed guoyejun closed 4 years ago

guoyejun commented 4 years ago

Hi,

I'm trying to enable ROI encoding for ffmpeg->msdk(qsv) path, but met GPU hang issue, hope to get help, thanks.

my system is skylake + ubuntu 16.04 my msdk version: e6caad2d1e00380f8cab045de62567a0a4a53a53 my iHD version: 644fc6d2bb9e6271a2b91578b1bc2f63275d184a

Firstly, I tried with msdk's sample and verified that it works. ./sample_multi_transcode -i::h265 str352x288.h265 -hw -n 1 -o::h264 ./str.roi.h264 -roi_file roi

Then, i add roi encoding code based on msdk's sample (sample_encode), and it works too. See command blow (out352x288.yuv just contain 1 frame). $ ./sample_encode h264 -i out352x288.yuv -hw -o out.h264 -w 352 -h 288 I also uploaded my change at https://github.com/guoyejun/MediaSDK/commit/20644f25e5204ae9b9760219bf12f101e20ac52d in case you are interested.

Then, I did a similar change in ffmpeg to enable roi encoding. See my change at https://github.com/guoyejun/ffmpeg/commit/4edb1516096e9b2880144d55afde2e1b2ad36f1d, (they are hard code for easy debug).

the h265 encoder works correctly with command: ./ffmpeg -s 352x288 -i out352x288.yuv -c:v hevc_qsv -y qsv.roi.h265

but, there is GPU hang with h264 encoder with command: ./ffmpeg -s 352x288 -i out352x288.yuv -c:v h264_qsv -y qsv.roi.h264

Since msdk's roi encoding API does not tie with encoders, and my code for h265 works, I would say my code is basically right. And there is GPU hang with ffmpeg + my code, while no issue with sample_encode + my code, I would guess there is a limitation (or special requirements) from msdk for the roi encoding, but looks that I did not find hints in msdk's document.

I've tried to enable LIBVA_TRACE for the two cases (ffmpeg + my code VS sample_encode + my code), and no weird found between the logs. I even did some tricky code above libva to make the log of LIBVA_TRACE exactly the same (except the time records), still same issue.

i also attached GPU hang's log from /sys/kernel/debug/dri/0/i915_error_state, see i915_error_state.txt

Are there any dump/parse tools at msdk level or media driver level? Or anything others i can try? thanks.

dmitryermilov commented 4 years ago

Hi @guoyejun ,

Well, it's really pretty weird. It seems you already covered most of the options. Let's try one more. Please compare MSDK API level logs in case of samples and ffmpeg. For it, please use the MediaSDK tracer. Unfortunately it's not opensourced yet but you can use the tracer from MediaServerStudio MediaSDK release. Attached the latest one. tracer.zip

guoyejun commented 4 years ago

thanks @dmitryermilov, could you let me know where to download "Intel®Media Server Studio 2018 R2 for Linux Servers" mentioned in the pdf file of tracer.zip, i searched it and always finally got https://github.com/Intel-Media-SDK/MediaSDK/releases, i think this is not the correct one for "MediaServerStudio MediaSDK release".

i did try the trace tool with my current opensource msdk, and it crashes with: [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.InitEx failed at /work/media/MediaSDK/samples/sample_encode/src/pipeline_encode.cpp ...

btw, forgot to mention yesterday, for ffmpeg + my code case, there is no gpu hang if I just comment one line ("configBuffers.push_back(m_roiBufferId);") in msdk code:

in file /_studio/mfx_lib/shared/src/mfx_h264_encode_vaapi.cpp, function MfxHwH264Encode::VAAPIEncoder::Execute, near line 2860.

if (task.m_numRoi)
{
    MFX_CHECK_WITH_ASSERT(MFX_ERR_NONE == SetROI(task, m_arrayVAEncROI, m_vaDisplay, m_vaContextEncode, m_roiBufferId),  MFX_ERR_DEVICE_FAILED);

    // no GPU hang if the following line is commented.
    configBuffers.push_back(m_roiBufferId);
}
dmitryermilov commented 4 years ago

You're welcome.

where to download "Intel®Media Server Studio 2018 R2 for Linux Servers"

It was available publicly but now it's removed from Intel sites. You don't need it:)

I just double-checked the package I had shared. It works on my side.

I assume you followed incorrect (out-of-date) installation steps from readme-tracer.pdf. Please follow these steps:

sudo mv /opt/intel/mediasdk/lib64/libmfxhw64.so.1.30 /opt/intel/mediasdk/lib64/libmfxhw64.so.1.30-real
sudo ln -s -f /path/to/libmfx-tracer.so /opt/intel/mediasdk/lib64/libmfxhw64.so.1.30
./mfx-tracer-config --default
./mfx-tracer-config core.type file
./mfx-tracer-config core.log ~/mfxtracer.log
./mfx-tracer-config core.lib /opt/intel/mediasdk/lib64/libmfxhw64.so.1.30-real

or instead of file dumping you can output to console by ./mfx-tracer-config core.type file Please don't forget to change "/opt/intel/mediasdk" to the right path if it's different at your side.

Eventually, when you execute ffmpeg or some msdk sample you'll see a file like ~/mfxtracer_6988.log

guoyejun commented 4 years ago

yes, it works with your steps.

I did a quick log compare and has not found the point yet, could you find any clue? thanks. See msdk logs of ffmpeg+my code, and msdk sample + my code. ffmpeg.mfxtracer_6372.log msdksample.mfxtracer_6030.log

I also enabled LIBVA_TRACE to get the logs for your reference. ffmpeg.103640.thd-0x000018e9.txt ffmpeg.103640.thd-0x000018e4.txt

msdksample.102449.thd-0x0000178f.txt msdksample.102449.thd-0x0000178e.txt

btw, libva_trace of ffmpeg+my code shows num_buffers is 18, while msdk sample +my code shows num_buffers is 20. To make the compare a bit easier, I'v made a change in ffmpeg to make the num_buffers 20, see https://github.com/guoyejun/ffmpeg/commit/2b69dfa5d21bd5ec3915e7e1ebd1ca385dcb078e

dmitryermilov commented 4 years ago

It seems mfxtracer logs are not full. E.g. I don't see EncodeFrameAsync calls there.

Please do and recollect logs: ./mfx-tracer-config core.level full

guoyejun commented 4 years ago

thanks for the info, please see attached logs. 1211ffmpeg.mfxtracer_27817.log 1211msdksample.mfxtracer_15189.log

dmitryermilov commented 4 years ago

Thank you @guoyejun . You know, I looked at logs and honestly didn't see something obvious which could cause a GPU hang. Although there are lots of difference I can't find something really suspicious. Can you please try to align step by step MFXVideoENCODE_Init parameters in ffmpeg qsv code with sample_encode and check when the issue goes away (at some point it should!) ?

guoyejun commented 4 years ago

thanks, I found the reason is that multiple frame mode is enabled as default at ffmpeg side.

btw, how to query the max supported roi numbers? thanks.

the document at https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#mfxVideoParam says: Number of ROI descriptions in array. The Query function mode 2 returns maximum supported value (set it to 256 and Query will update it to maximum supported value).

i do not understand what 'mode 2' means, any sample code? thanks.

i did a try and debug into MFXVideoENCODE_Query, but do not find where the value is queried. The most possible part is: mfxRes = handler == codecId2Handlers.end() ? MFX_ERR_UNSUPPORTED : (handler->second.primary.query)(session, in, out);

but i'm unable to step into it within gdb. (i have built msdk with debug mode)

dmitryermilov commented 4 years ago

thanks, I found the reason is that multiple frame mode is enabled as default at ffmpeg side.

Great! We need to check if driver returns valid caps for MFE.

i do not understand what 'mode 2' means, any sample code? thanks.

It's here: MFXVideoENCODE_Query

This function works in either of four modes: ... If the in parameter is non-zero, the function checks the validity of the fields in the input structure. Then the function returns the corrected values in the output structure. If there is insufficient information to determine the validity or correction is impossible, the function zeroes the fields. This feature can verify whether the SDK implementation supports certain profiles, levels or bitrates.

From code perspective: https://github.com/Intel-Media-SDK/MediaSDK/blob/1f8456f10bdb204b0ea3067df7f5a3e8a1de407f/_studio/mfx_lib/encode_hw/h264/src/mfx_h264_encode_hw.cpp#L367 https://github.com/Intel-Media-SDK/MediaSDK/blob/9fd26ab972e9f7a2bd74e242827a26bc78330872/_studio/mfx_lib/shared/src/mfx_h264_enc_common_hw.cpp#L2305 https://github.com/Intel-Media-SDK/MediaSDK/blob/9fd26ab972e9f7a2bd74e242827a26bc78330872/_studio/mfx_lib/shared/src/mfx_h264_enc_common_hw.cpp#L4365

guoyejun commented 4 years ago

thanks, I found the reason is that multiple frame mode is enabled as default at ffmpeg side.

Great! We need to check if driver returns valid caps for MFE.

yes, ffmpeg only enables it as default when MFE is really supported. We need to disable it if want to support ROI encoding.

i do not understand what 'mode 2' means, any sample code? thanks.

It's here: MFXVideoENCODE_Query

This function works in either of four modes: ... If the in parameter is non-zero, the function checks the validity of the fields in the input structure. Then the function returns the corrected values in the output structure. If there is insufficient information to determine the validity or correction is impossible, the function zeroes the fields. This feature can verify whether the SDK implementation supports certain profiles, levels or bitrates.

thanks, i'll continue the try. btw, what does 'mode 2' mean?

dmitryermilov commented 4 years ago

btw, what does 'mode 2' mean?

Just enumerator, 1, 2, 3..

guoyejun commented 4 years ago

i found an issue to query the max supported roi number, and created a PR for the fix, see https://github.com/Intel-Media-SDK/MediaSDK/pull/1856.

guoyejun commented 4 years ago

hi

just want to confirm that just h264 and h265 hw encoders support roi encoding? And mjpeg, vp9 and mpeg2 encoders do not support roi encoding? thanks.

guoyejun commented 4 years ago

another question, do you have any plan when to fix the gpu hang issue when mfe and roi are both enabled? thanks.

dmitryermilov commented 4 years ago

Hi @guoyejun

just want to confirm that just h264 and h265 hw encoders support roi encoding? And mjpeg, vp9 and mpeg2 encoders do not support roi encoding? thanks.

It's right.

another question, do you have any plan when to fix the gpu hang issue when mfe and roi are both enabled? thanks.

@DenWolf , I assume it's a question to you:)