OpenVisualCloud / SVT-HEVC

SVT HEVC encoder. Scalable Video Technology (SVT) is a software-based video coding technology that is highly optimized for Intel® Xeon® processors. Using the open source SVT-HEVC encoder, it is possible to spread video encoding processing across multiple Intel® Xeon® processors to achieve a real advantage of processing efficiency.
Other
519 stars 169 forks source link

Fixed the encoding random hang issue with few threads. #475

Closed Austin-Hu closed 4 years ago

Austin-Hu commented 4 years ago

This is an old issue since v1.2.0. Overall it's due to the incorrect life cycle management of the PPCS, PA RefPic, PCS objects in different kernels (through ResourceCoordination to EncDec, and PictureManger & RateControl with feedbacks from EncDec and Packetization separately).

When the random hang happens, the former process is:

  1. PPCS A which is referenced by other picture(s) B (& C), has been encoded and packetized completely. RateControl releases it after getting the feedback back from Packetization;
  2. The PPCS A object is enqueued into the head of (the empty queue of) PPCS FIFO (for higher cache hit);
  3. ResourceCoordination receives a new picture D, dequeues the PPCS object (aligned with A) from head of PPCS FIFO, and makes picture D go through the latter encoding stages;
  4. In PictureDecision, the contents (including isUsedAsReferenceFlag) of the PPCS object are updated according to parameters of picture D;
  5. Meanwhile in EncDec, PPCS A is still on-going with its latter stuff (such as SAO, padding, etc.) after encoding and feedbacking to EntropyCoding. But as the isUsedAsReferenceFlag of PPCS object belonged to A was changed, A couldn't be fed back to PictureManager.
  6. In the inputPictureQueue of PictureManger, picture B (& C) are permanently waiting for A to be available;
  7. Encoding for latter pictures is blocked (objects from FIFO queues exhaust) and hangs.

So the solution is: (Note: we need to manage the life cycle of PPCS, PA RefPic and PCS RefPic objects separately, because they're operated in pictureDecisionPaReferenceQueue of PictureDecision and referencePictureQueue of PictureManager, besides of other kernels.)

  1. In ResourceCoordination, make sure the live count of PPCS and its PA RefPic objects are 2, because they will be asynchronously released in both EncDec and RateControl kernels;
  2. In PictureDecision, increase the dependentPicturesCount of the PA RefPic object, when a PPCS is indeed referenced by other picture(s). And don't release PPCS and its PA RefPic objects at the end when trying to clean up the entries of the pictureDecisionPaReferenceQueue;
  3. In InitialRateControl, don't release PPCS and its PA RefPic objects, as they'd be used in the following kernels;
  4. In PictureManager, increase the refCount of the PCS RefPic object, when the PCS indeed references other picture(s);
  5. In EncDec, release PPCS and its PA RefPic objects which the encoded picture is referencing to, and the encoded picture's PCS RefPic object. Also release the encoded picture's PPCS and PA RefPic objects;
  6. When PictureManager receives the feedback of Packetization, release PPCS and PA RefPic objects of the encoded picture as well.

With the changes, the live count for the PPCS and PA Ref objects in each kernel stage are (numbers in brackets means the current picture is referenced by other picture(s)):

Obj | RC | PD | EncDec | RateControl PPCS: | 2 | (+1) | 1 (- 1) | 0 PA Ref: | 2 | (+1) | 1 (- 1) | 0

compared with original implementation:

Obj | RC | IRC | PD | RateControl PPCS: | 3 | 2 | 1 (+ 1 - 1) | 0 PA Ref: | 2 | 1 | 0 (+ 1 - 1) |

Signed-off-by: Austin Hu austin.hu@intel.com

Fixes #474 .

Austin-Hu commented 4 years ago

Great to see the comment in there..and good job finding this fix.

This comment pertains to a lot of the codebase, but it would be great to simplify some of the pointer dereferencing.. e.g. pictureControlSetPtr->ParentPcsPtr->refPaPcsArray[REF_LIST_0] is used in numerous places - might be simpler to read if a local variable was set and then used (e.g.) refList0 = pictureControlSetPtr->ParentPcsPtr->refPaPcsArray[REF_LIST_0]; ...

Thanks, @intelmark . I just followed the style in the kernel implementations, otherwise there'd be too many variables for such kind of dereferencing...

Austin-Hu commented 4 years ago

Completed several testing with the PR:

Full CI (No regression):

---------------------------------------------------------
Test Begin...
Total Number of Tests: 33750
Total Passed: 33689
Percentage Passed: 99.8192592593%
Time Elapsed: 3 day(s), 18 hour(s), 47 minute(s), 23 second(s)
---------------------------------------------------------

Full CI Failed Cases:

  1. asm_test: 24 (old R2R issue, not regression)
  2. buffered_test: 21
  3. defield_test: 10 (old R2R issue, not regression)
  4. multi_channel_test: 6


Fast CI (with "-lp 1" option) (No regression):

---------------------------------------------------------
Test Begin...
Total Number of Tests: 2433
Total Passed: 2431
Percentage Passed: 99.9177969585%
Time Elapsed: 13 hour(s), 3 minute(s), 33 second(s)
---------------------------------------------------------

Fast CI (with "-lp 1" option) Failed Cases:

  1. multi_channel_test: 2


Fast CI (with "-lp 1 -n 1000" options) (No regression):

(Executed its testing in 2 machines with ~6 days)

Fast CI (with "-lp 1 -n 1000" options) Failed Cases, which also fail without the PR applied:

  1. asm_test: 2
  2. defield_test: 4
  3. multi_channel_test: 2


Performance: some performance drop for small resolution (352x288):

Input Command Line Performance Argument Without PR #475 With PR #475 (at commit)
8K P420 8-bit ./SvtHevcEncApp -i 8k2D_musician_7680x3840_8bit_30Hz_P420.yuv -w 7680 -h 3840 -b 0.265 -n 10000 Average Speed (FPS) 38.96 38.91
Total Encoding Time (ms) 256673 256980
Total Execution Time (ms) 264407 264908
352x288 P420 8-bit ./SvtHevcEncApp -i akiyo_cif.y4m -b 0.265 -n 100000 Average Speed (FPS) 539.88 457.51
Total Encoding Time (ms) 185228 218576
Total Execution Time (ms) 185409 218760


Stress Testing:

  1. Issue #474 verification: encoding didn't hang with > 72 hours (> 210,000 times) repeat running.
  2. Continuous frames input: encoding didn't hang with > 500,000 input frames.
Austin-Hu commented 4 years ago

As for some performance drop "side effect" with small resolution (such as 352x288), it's because only the PPCS and PA Ref objects which are not referenced by other picture(s) any more can be dequeued from their corresponding FIFOs, rather than that the ones which are still referenced by other picture(s) could be dequeued without this PR applied. So by enlarging the length of those FIFOs, the encoding performance can be increased to ~600 FPS:

diff --git a/Source/Lib/Codec/EbEncHandle.c b/Source/Lib/Codec/EbEncHandle.c
index 3ca232a..853f9d3 100644
--- a/Source/Lib/Codec/EbEncHandle.c
+++ b/Source/Lib/Codec/EbEncHandle.c
@@ -1960,10 +1960,10 @@ void LoadDefaultBufferConfigurationSettings(
     sequenceControlSetPtr->tileGroupRowCountArray[5] = tileGroupRowCount;

     //#====================== Data Structures and Picture Buffers ======================
-    sequenceControlSetPtr->pictureControlSetPoolInitCount       = inputPic;
+    sequenceControlSetPtr->pictureControlSetPoolInitCount       = inputPic * 2;
     sequenceControlSetPtr->pictureControlSetPoolInitCountChild  = MAX(4, coreCount / 6);
     sequenceControlSetPtr->referencePictureBufferInitCount      = inputPic;//MAX((EB_U32)(sequenceControlSetPtr->inputOutputBufferFifoInitCount >> 1), (EB_U32)((1 << sequenceControlSetPtr->staticConfig.hierarchicalLevels) + 2));
-    sequenceControlSetPtr->paReferencePictureBufferInitCount    = inputPic;//MAX((EB_U32)(sequenceControlSetPtr->inputOutputBufferFifoInitCount >> 1), (EB_U32)((1 << sequenceControlSetPtr->staticConfig.hierarchicalLevels) + 2));
+    sequenceControlSetPtr->paReferencePictureBufferInitCount    = inputPic * 2;//MAX((EB_U32)(sequenceControlSetPtr->inputOutputBufferFifoInitCount >> 1), (EB_U32)((1 << sequenceControlSetPtr->staticConfig.hierarchicalLevels) + 2));
     sequenceControlSetPtr->reconBufferFifoInitCount             = sequenceControlSetPtr->referencePictureBufferInitCount;

     //#====================== Inter process Fifos ======================
Austin-Hu commented 4 years ago

As for some performance drop "side effect" with small resolution (such as 352x288), it's because only the PPCS and PA Ref objects which are not referenced by other picture(s) any more can be dequeued from their corresponding FIFOs, rather than that the ones which are still referenced by other picture(s) could be dequeued without this PR applied. So by enlarging the length of those FIFOs, the encoding performance can be increased to ~600 FPS:

diff --git a/Source/Lib/Codec/EbEncHandle.c b/Source/Lib/Codec/EbEncHandle.c
index 3ca232a..853f9d3 100644
--- a/Source/Lib/Codec/EbEncHandle.c
+++ b/Source/Lib/Codec/EbEncHandle.c
@@ -1960,10 +1960,10 @@ void LoadDefaultBufferConfigurationSettings(
     sequenceControlSetPtr->tileGroupRowCountArray[5] = tileGroupRowCount;

     //#====================== Data Structures and Picture Buffers ======================
-    sequenceControlSetPtr->pictureControlSetPoolInitCount       = inputPic;
+    sequenceControlSetPtr->pictureControlSetPoolInitCount       = inputPic * 2;
     sequenceControlSetPtr->pictureControlSetPoolInitCountChild  = MAX(4, coreCount / 6);
     sequenceControlSetPtr->referencePictureBufferInitCount      = inputPic;//MAX((EB_U32)(sequenceControlSetPtr->inputOutputBufferFifoInitCount >> 1), (EB_U32)((1 << sequenceControlSetPtr->staticConfig.hierarchicalLevels) + 2));
-    sequenceControlSetPtr->paReferencePictureBufferInitCount    = inputPic;//MAX((EB_U32)(sequenceControlSetPtr->inputOutputBufferFifoInitCount >> 1), (EB_U32)((1 << sequenceControlSetPtr->staticConfig.hierarchicalLevels) + 2));
+    sequenceControlSetPtr->paReferencePictureBufferInitCount    = inputPic * 2;//MAX((EB_U32)(sequenceControlSetPtr->inputOutputBufferFifoInitCount >> 1), (EB_U32)((1 << sequenceControlSetPtr->staticConfig.hierarchicalLevels) + 2));
     sequenceControlSetPtr->reconBufferFifoInitCount             = sequenceControlSetPtr->referencePictureBufferInitCount;

     //#====================== Inter process Fifos ======================

Addressed the "side effect" by commit.

Austin-Hu commented 4 years ago

Would it make more sense to add a null pointer check to EbReleaseObject?

It's supposed that those PPCS and PA Ref object's corresponding wrappers exist in the EncDec and RateControl processes. Thanks!

tianjunwork commented 4 years ago

./SvtHevcEncApp -i ../../../../yuv/bbb_1920x1080_420p.yuv -w 1920 -h 1080 -encMode 0 -intra-period 100 -scd 0 -rc 0 -q 22 -n 1000 -b 0.265 random hang. I wonder if it can be fixed by this PR too.

Austin-Hu commented 4 years ago

./SvtHevcEncApp -i ../../../../yuv/bbb_1920x1080_420p.yuv -w 1920 -h 1080 -encMode 0 -intra-period 100 -scd 0 -rc 0 -q 22 -n 1000 -b 0.265 random hang. I wonder if it can be fixed by this PR too.

Thanks for capturing this!

At first, this is another (fixed, not random) hang issue which hasn't been addressed by this PR, or isn't regression caused by this PR.

And here is the hang pattern due to encMode 0 and intra period:

Anyway, it's another hang issue we need to look at.

tianjunwork commented 4 years ago

Hi @Austin-Hu , the performance data in above table is without the fixes of fifo size, right? With the fix of PPCS and PA fifo size, cif fps goes up to 600?

Austin-Hu commented 4 years ago

Hi @Austin-Hu , the performance data in above table is without the fixes of fifo size, right? With the fix of PPCS and PA fifo size, cif fps goes up to 600?

Yes, I got the dropped performance data before figuring out this commit. And with that commit applied as well, there isn't performance gap with the whole PR applied or not.

tianjunwork commented 4 years ago
Memory: 1080p cif
master allocated memory 2.29G 899.14 MB
This PR allocated memory 2.83G 940MB

Speed: image Total execution time is not affected much.

image Both cif and 1080P fps gains at faster encmode.