OpenVisualCloud / SVT-HEVC

SVT HEVC encoder. Scalable Video Technology (SVT) is a software-based video coding technology that is highly optimized for Intel® Xeon® processors. Using the open source SVT-HEVC encoder, it is possible to spread video encoding processing across multiple Intel® Xeon® processors to achieve a real advantage of processing efficiency.
Other
516 stars 172 forks source link

Replaced memcpy with referencing input frame buffers. #525

Closed Austin-Hu closed 4 years ago

Austin-Hu commented 4 years ago

By replacing memcpy in EbH265EncSendPicture, with direct assigning the pointers of Input FIFO object with the ones of the EB_H265_ENC_INPUT object YUV buffers, which should be safe (loked) by application.

The buffers belonging to an Input FIFO object don't need to be padded, because they're just used as source sample data in PictureAnalysis, MotionEstimation and EncDec (sharpness enabled) kernels.

Note: width and height of input frames are required to be aligned with 8 samples by de-blocking filter. So the solution applies for almost all typical resolutions.

ffmpeg & GStreamer SVT-HEVC plugin: references the input AVFrame objects, and manages the input referenced frames with linked list.

Signed-off-by: Austin Hu austin.hu@intel.com

Austin-Hu commented 4 years ago

Encoding performance, memory consumption and PTS 0 latency comparison between master commit and with the PR applied, when encoding 8K 8-bit P420 input with Intel Xeon 8280:

Command Line Master PR #525
FPS Memory (GB) PTS 0 Latency (ms) FPS Memory (GB) PTS 0 Latency (ms)
./SvtHevcEncApp -i 8k2D_musician_7680x3840_8bit_30Hz_P420.yuv -w 7680 -h 3840 -b 0.265 -n 80000 40.35 44.91 ~1640 40.90 38.05 ~1400
./ffmpeg -stream_loop -1 -video_size 7680x3840 -r 60 -i 8k2D_musician_7680x3840_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265 25 44.91 ~2100 30 38.05 ~1400


Got the encoding performance data, after encoding ~50000 frames. And there is little improvement by encoding with SvtHevcEncApp which needs more investigation.

tianjunwork commented 4 years ago

Thank you very much Austin for prototyping this feature!

SGShen commented 4 years ago

I did a test with the Sample App. The PR improves FPS and encoding time, somehow, it causes latency almost doubled. Has anyone noticed the similar issue? Thanks.

  encMode FPS Encoding time Exec Time Avg Latency Max Latency
no PR 10 39.29 25450 29144 396 562
w/ PR 10 47.71 20962 24680 796 1210
no PR 9 29.11 34357 38027 1064 1550
w/ PR 9 30.25 33059 36755 2405 3168

The command line I used is: $numactl --cpubind=0 --membind=0 ./SvtHevcEncApp -i ~/tests/videos/7680x3840_420_5.yuv -w 7680 -h 3840 -fps 30 -scd 0 -pred-struct 0 -irefresh-type 0 -rc 1 -tbr 70000000 -temporal-id 0 -lad 4 -encMode 10 -fpsinvps 0 -tile_col_cnt 6 -tile_row_cnt 12 -tile_slice_mode 1 -intra-period 4 -n 1000

Austin-Hu commented 4 years ago

I did a test with the Sample App. The PR improves FPS and encoding time, somehow, it causes latency almost doubled. Has anyone noticed the similar issue? Thanks.

  encMode FPS Encoding time Exec Time Avg Latency Max Latency no PR 10 39.29 25450 29144 396 562 w/ PR 10 47.71 20962 24680 796 1210 no PR 9 29.11 34357 38027 1064 1550 w/ PR 9 30.25 33059 36755 2405 3168 The command line I used is: $numactl --cpubind=0 --membind=0 ./SvtHevcEncApp -i ~/tests/videos/7680x3840_420_5.yuv -w 7680 -h 3840 -fps 30 -scd 0 -pred-struct 0 -irefresh-type 0 -rc 1 -tbr 70000000 -temporal-id 0 -lad 4 -encMode 10 -fpsinvps 0 -tile_col_cnt 6 -tile_row_cnt 12 -tile_slice_mode 1 -intra-period 4 -n 1000

Hi @SGShen ,

We’d better encode for more (> 10000) frames after the statistics are stable enough.

The “Avg Latency” is calculated in SvtHevcEncApp by totalLatency / frameCount, where totalLatency is accumulated from the “encoding time” of each frame. But, the problem is that the “encoding time” is measured since the 1st encoding stage (Resource Coordination) to the last one (Packetization), but doesn’t contain the original memcpy time taken in EbH265EncSendPicture which is the FIRST point to encode each frame. So please get the latency in App level since you call EbH265EncSendPicture, and to EbH265GetPacket where you get the corresponding encoded frame (with PTS). Thanks!

SGShen commented 4 years ago

@Austin-Hu Understood the memcpy was not counted in the "Avg latency". But the results showed there is a negative impact of this PR on "Avg latency".

Like your mentioned, "Avg. latency" = "Total Latency"/ "frame counts", and "Total Latency" does not include memcpy which is removed from this PR. So it seems unlikely to get a worse "Avg. Latency" with this PR. Either the calculation formula is incorrectly executed, or there is a hidden impact to the encoding.

Austin-Hu commented 4 years ago

With the 4th commit applied, here is the updated encoding performance and memory consumption comparison between master and with the PR applied, when encoding 8-bit P420 inputs with ffmpeg (encMode 7 & tune 1 as default) on Intel Xeon 8280:

Resolution Encoded Frames Command Line FPS Memory (GB)
Master PR #525 Master PR #525
480p 150000 ./ffmpeg -stream_loop -1 -video_size 720x480 -r 60 -i 8k2D_musician_720x480_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265 641 676 1.72 1.64
1080p 100000 ./ffmpeg -stream_loop -1 -video_size 1920x1080 -r 60 -i Fallout4_1920x1080_8bit_60Hz_P420.yuv -c:v libsvt_hevc -y 0.265 222 223 4.4 4.01
4K 80000 ./ffmpeg -stream_loop -1 -video_size 3840x2160 -r 60 -i 8k2D_musician_3840x2160_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265 138 152 13.5 11.48
8K 40000 ./ffmpeg -stream_loop -1 -video_size 7680x4320 -r 60 -i 8k2D_musician_7680x4320_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265 23 27 49.39 41.7


The side effect (found till now) of this PR is that, the size of encoded bitstream increases with CQP rate control mode (after encoding 8K input with 2000 frames):

Rate Control Mode Command Line Bitstream Size (MB) Bit Rate (Mbits/s)
Master PR #525 Master PR #525
CQP ./ffmpeg -stream_loop -1 -video_size 7680x4320 -r 60 -i 8k2D_musician_7680x4320_8bit_30Hz_P420.yuv -c:v libsvt_hevc -vframes 2000 (-rc 1) -y 0.265 74.699 87.695 17.99 21.12
VBR 29.676 29.539 7.147 7.114
Austin-Hu commented 4 years ago

Even though memcpy for input frame takes “much” (~15 ms for 8K input) time for each frame, it takes much longer time (no big difference with different “-pred-struct” values) when each frame goes through the whole encoding pipeline, as what “Average Latency” shows after running SvtHevcEncApp:

Command Line Time Consumed By Each Frame (ms)
CopyFrameBuffer Encoding Pipeline
LAD 17 LAD 1
./SvtHevcEncApp -i Fallout4_1920x1080_8bit_60Hz_P420.yuv -w 1920 -h 1080 -n 10000 -b 0.265 ~1 ~700 ~680
./SvtHevcEncApp -i rally_8192x4096_30.yuv -w 8192 -h 4096 -n 10000 -b 0.265 ~15 ~3000 ~2400


So even though CopyFrameBuffer takes \~0 ms after applying this PR, it couldn’t help to improve the encoding performance. And this PR would only be helpful, in low latency mode where encoding a frame takes much less time (for example, 100~200 ms).

intelmark commented 4 years ago

Given the limited use case of this PR, the team thought best to close it for now. In most cases the CopyFrameBuffer percentage to overall encoding time is quite low, hence the total latency is not reduced much.