OpenVisualCloud / SVT-HEVC

SVT HEVC encoder. Scalable Video Technology (SVT) is a software-based video coding technology that is highly optimized for Intel® Xeon® processors. Using the open source SVT-HEVC encoder, it is possible to spread video encoding processing across multiple Intel® Xeon® processors to achieve a real advantage of processing efficiency.
Other
519 stars 169 forks source link

Fixed the encoding crashing issue for some legacy CPUs. #362

Closed Austin-Hu closed 5 years ago

Austin-Hu commented 5 years ago

By checking whether the processor supports XGETBV instruction or not, beforeing invoking it to check ZMM/YMM/XMM states are enabled in XCR0 register.

Signed-off-by: Austin Hu austin.hu@intel.com

1480c1 commented 5 years ago

When running lldb -- ./SvtHevcEncApp -i ../akiyo_cif.y4m

Encoding          Process 1509 stopped
* thread #3, name = 'SvtHevcEncApp', stop reason = signal SIGILL: illegal instruction operand
    frame #0: 0x00007ffff79d7175 libSvtHevcEnc.so.1`IntraModeVerticalChroma_AVX2_INTRIN [inlined] _mm_loadu_si128(__P=0x0000555561aa8531) at emmintrin.h:703:10
   700  extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
   701  _mm_loadu_si128 (__m128i_u const *__P)
   702  {
-> 703    return *__P;
   704  }
   705 
   706  extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
Austin-Hu commented 5 years ago

@1480c1 , according to IntrinsicGuide, _mm_loadu_si128 should be an SSE2 intrinsic which should be supported on your X3430 platform.

So could you help check:

  1. whether mm_loadu_si128 is also invoked or not, before IntraModeVerticalChroma_AVX2_INTRIN() is triggered?
  2. just dereference the pointer of "refSamples + topOffset" directly before mm_loadu_si128 in IntraModeVerticalChroma_AVX2_INTRIN()?

Thanks!

1480c1 commented 5 years ago
  1. whether mm_loadu_si128 is also invoked or not, before IntraModeVerticalChroma_AVX2_INTRIN() is triggered?

It seems yes. It ran in IntraModeHorizontalLuma_SSE2_INTRIN and CombinedAveraging32xMSAD_SSE2_INTRIN

2. just dereference the pointer of "refSamples + topOffset" directly before mm_loadu_si128 in IntraModeVerticalChroma_AVX2_INTRIN()?

Can you elaborate on what do you mean by this?

Austin-Hu commented 5 years ago
  1. whether mm_loadu_si128 is also invoked or not, before IntraModeVerticalChroma_AVX2_INTRIN() is triggered?

It seems yes. It ran in IntraModeHorizontalLuma_SSE2_INTRIN and CombinedAveraging32xMSAD_SSE2_INTRIN

It's weird, because the intrinsic can execute in those functions you mentioned successfully before the crashing point...

Because we don't have such CPU to debug in hand, or we'd try to find one in some day. And could you help debug this issue on your X3430 platform, if you have more bandwidth, @1480c1 ? For example, try to load refSamples (without topOffset added) with _mm_loadu_si128, or try to load with other intrinsic (such as _mm_loadu_si64) to see any difference. Thanks!

  1. just dereference the pointer of "refSamples + topOffset" directly before mm_loadu_si128 in IntraModeVerticalChroma_AVX2_INTRIN()?

Can you elaborate on what do you mean by this?

I wanted you to just print the value of *(refSamples + topOffset) directly, but ignore it right now due to your verification for the 1st question.

1480c1 commented 5 years ago

Because we don't have such CPU to debug in hand, or we'd try to find one in some day. And could you help debug this issue on your X3430 platform, if you have more bandwidth, @1480c1 ? For example, try to load refSamples (without topOffset added) with _mm_loadu_si128, or try to load with other intrinsic (such as _mm_loadu_si64) to see any difference. Thanks!

Sorry for not seeing this, since it's an edit, it didn't notify anything.

I can try, however, I'm not proficient in understanding simd instructions atm.

Would it be preferable to have a way for you to ssh into it?

Austin-Hu commented 5 years ago

Because we don't have such CPU to debug in hand, or we'd try to find one in some day. And could you help debug this issue on your X3430 platform, if you have more bandwidth, @1480c1 ? For example, try to load refSamples (without topOffset added) with _mm_loadu_si128, or try to load with other intrinsic (such as _mm_loadu_si64) to see any difference. Thanks!

Sorry for not seeing this, since it's an edit, it didn't notify anything.

I can try, however, I'm not proficient in understanding simd instructions atm.

Would it be preferable to have a way for you to ssh into it?

@1480c1 , sorry, I should have added my comments by appending a new one, rather than editting.

I'm not proficient but I'd like to try to after completing my current tasks. Could you share me your SSH info, and are you using Linux or Windows? Thanks!

1480c1 commented 5 years ago

I can setup an ssh account on the machine, it's running Ubuntu 19.04. currently about to have an exam, so will not be able to do it immediately

Austin-Hu commented 5 years ago

I can setup an ssh account on the machine, it's running Ubuntu 19.04. currently about to have an exam, so will not be able to do it immediately

OK, let me try to log in it.

1480c1 commented 5 years ago

I sent you an email to the one in the commit with the information

Austin-Hu commented 5 years ago

Hi @1480c1 ,

As for the latter "_mm_loadu_si128" crashing issue, I think its root cause is that, in the implementation of some (or most) kernel threads, using AVX and SSE intrinsics were mixed together, such as function IntraModeVerticalChroma_AVX2_INTRIN(). So with your X3430 CPU without AVX intrinsic, the IntraVerticalChroma_funcPtrArray function pointer array would point to the IntraModeVerticalChroma_AVX2_INTRIN() code with PREAVX2_MASK, and would go through the section using AVX intrinsic (the 256bit version, or the 128bit one compiled with -mavx2 gcc option) which it shouldn't have gone.

The way to address this issue, is to separate the SSE/AVX intrinsic usings into different files, compiled with corresponding gcc x86 options. And the most ideal way I was considering is to use the "CPU dispatch" mechanism mentioned in Section 13.5 of "Optimizing Software in C++". But it would not be feasible for the SVT encoder project, because the function pointer of most kernel threads changes during runtime. For example, using either SSE or AVX in IntraModeVerticalChroma_AVX2_INTRIN() depends on the Chroma PU size, rather than being set during encoder initialization stage.

So I submitted an experimental commit, to prove that the file using SSE and being compiled with -mavx2 would cause the crashing issue in your CPU withough AVX supported. And with the commit applied, the "Illegal instruction" issue wouldn't happen in IntraModeVerticalChroma_AVX2_INTRIN(), but it occurred again in another following function IntraModeDCChroma_AVX2_INTRIN().

In order to fully resolve this issue, we need too much similar seperation (physical effort actually) for all the implementations of kernel threads which mixed different intrinsics, not only for IntraModeVerticalChroma_AVX2_INTRIN(). So let's set it as low priority and refine those codes in the future to support legacy CPUs.

Thanks!

Austin-Hu commented 5 years ago

BTW, I suggest to review and merge the 1st commit for both SVT-HEVC and SVT-VP9, as it's a reasonable fix. @1480c1 @tianjunwork @lijing0010

Austin-Hu commented 5 years ago

To merge the PR #376 which is a correct fix with single commit for this issue.

As for the experimental commit, which can be referenced instead here, set it as low priority.