Replaced memcpy with referencing input frame buffers.

OpenVisualCloud / SVT-HEVC

SVT HEVC encoder. Scalable Video Technology (SVT) is a software-based video coding technology that is highly optimized for Intel® Xeon® processors. Using the open source SVT-HEVC encoder, it is possible to spread video encoding processing across multiple Intel® Xeon® processors to achieve a real advantage of processing efficiency.

Other

516 stars 172 forks source link

Replaced memcpy with referencing input frame buffers. #525

Closed Austin-Hu closed 4 years ago

Austin-Hu commented 4 years ago

By replacing memcpy in EbH265EncSendPicture, with direct assigning the pointers of Input FIFO object with the ones of the EB_H265_ENC_INPUT object YUV buffers, which should be safe (loked) by application.

The buffers belonging to an Input FIFO object don't need to be padded, because they're just used as source sample data in PictureAnalysis, MotionEstimation and EncDec (sharpness enabled) kernels.

Note: width and height of input frames are required to be aligned with 8 samples by de-blocking filter. So the solution applies for almost all typical resolutions.

ffmpeg & GStreamer SVT-HEVC plugin: references the input AVFrame objects, and manages the input referenced frames with linked list.

Signed-off-by: Austin Hu austin.hu@intel.com

Austin-Hu commented 4 years ago

Encoding performance, memory consumption and PTS 0 latency comparison between master commit and with the PR applied, when encoding 8K 8-bit P420 input with Intel Xeon 8280:

Command Line	Master			PR #525
	FPS	Memory (GB)	PTS 0 Latency (ms)	FPS	Memory (GB)	PTS 0 Latency (ms)
./SvtHevcEncApp -i 8k2D_musician_7680x3840_8bit_30Hz_P420.yuv -w 7680 -h 3840 -b 0.265 -n 80000	40.35	44.91	~1640	40.90	38.05	~1400
./ffmpeg -stream_loop -1 -video_size 7680x3840 -r 60 -i 8k2D_musician_7680x3840_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265	25	44.91	~2100	30	38.05	~1400

Got the encoding performance data, after encoding ~50000 frames. And there is little improvement by encoding with SvtHevcEncApp which needs more investigation.

tianjunwork commented 4 years ago

Thank you very much Austin for prototyping this feature!

SGShen commented 4 years ago

I did a test with the Sample App. The PR improves FPS and encoding time, somehow, it causes latency almost doubled. Has anyone noticed the similar issue? Thanks.

	encMode	FPS	Encoding time	Exec Time	Avg Latency	Max Latency
no PR	10	39.29	25450	29144	396	562
w/ PR	10	47.71	20962	24680	796	1210
no PR	9	29.11	34357	38027	1064	1550
w/ PR	9	30.25	33059	36755	2405	3168

The command line I used is: $numactl --cpubind=0 --membind=0 ./SvtHevcEncApp -i ~/tests/videos/7680x3840_420_5.yuv -w 7680 -h 3840 -fps 30 -scd 0 -pred-struct 0 -irefresh-type 0 -rc 1 -tbr 70000000 -temporal-id 0 -lad 4 -encMode 10 -fpsinvps 0 -tile_col_cnt 6 -tile_row_cnt 12 -tile_slice_mode 1 -intra-period 4 -n 1000

Austin-Hu commented 4 years ago

I did a test with the Sample App. The PR improves FPS and encoding time, somehow, it causes latency almost doubled. Has anyone noticed the similar issue? Thanks.

encMode FPS Encoding time Exec Time Avg Latency Max Latency no PR 10 39.29 25450 29144 396 562 w/ PR 10 47.71 20962 24680 796 1210 no PR 9 29.11 34357 38027 1064 1550 w/ PR 9 30.25 33059 36755 2405 3168 The command line I used is: $numactl --cpubind=0 --membind=0 ./SvtHevcEncApp -i ~/tests/videos/7680x3840_420_5.yuv -w 7680 -h 3840 -fps 30 -scd 0 -pred-struct 0 -irefresh-type 0 -rc 1 -tbr 70000000 -temporal-id 0 -lad 4 -encMode 10 -fpsinvps 0 -tile_col_cnt 6 -tile_row_cnt 12 -tile_slice_mode 1 -intra-period 4 -n 1000

Hi @SGShen ,

We’d better encode for more (> 10000) frames after the statistics are stable enough.

The “Avg Latency” is calculated in SvtHevcEncApp by totalLatency / frameCount, where totalLatency is accumulated from the “encoding time” of each frame. But, the problem is that the “encoding time” is measured since the 1st encoding stage (Resource Coordination) to the last one (Packetization), but doesn’t contain the original memcpy time taken in EbH265EncSendPicture which is the FIRST point to encode each frame. So please get the latency in App level since you call EbH265EncSendPicture, and to EbH265GetPacket where you get the corresponding encoded frame (with PTS). Thanks!

SGShen commented 4 years ago

@Austin-Hu Understood the memcpy was not counted in the "Avg latency". But the results showed there is a negative impact of this PR on "Avg latency".

Like your mentioned, "Avg. latency" = "Total Latency"/ "frame counts", and "Total Latency" does not include memcpy which is removed from this PR. So it seems unlikely to get a worse "Avg. Latency" with this PR. Either the calculation formula is incorrectly executed, or there is a hidden impact to the encoding.

Austin-Hu commented 4 years ago

With the 4th commit applied, here is the updated encoding performance and memory consumption comparison between master and with the PR applied, when encoding 8-bit P420 inputs with ffmpeg (encMode 7 & tune 1 as default) on Intel Xeon 8280:

Resolution	Encoded Frames	Command Line	FPS		Memory (GB)
			Master	PR #525	Master	PR #525
480p	150000	./ffmpeg -stream_loop -1 -video_size 720x480 -r 60 -i 8k2D_musician_720x480_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265	641	676	1.72	1.64
1080p	100000	./ffmpeg -stream_loop -1 -video_size 1920x1080 -r 60 -i Fallout4_1920x1080_8bit_60Hz_P420.yuv -c:v libsvt_hevc -y 0.265	222	223	4.4	4.01
4K	80000	./ffmpeg -stream_loop -1 -video_size 3840x2160 -r 60 -i 8k2D_musician_3840x2160_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265	138	152	13.5	11.48
8K	40000	./ffmpeg -stream_loop -1 -video_size 7680x4320 -r 60 -i 8k2D_musician_7680x4320_8bit_30Hz_P420.yuv -c:v libsvt_hevc -y 0.265	23	27	49.39	41.7

The side effect (found till now) of this PR is that, the size of encoded bitstream increases with CQP rate control mode (after encoding 8K input with 2000 frames):

Rate Control Mode	Command Line	Bitstream Size (MB)		Bit Rate (Mbits/s)
		Master	PR #525	Master	PR #525
CQP	./ffmpeg -stream_loop -1 -video_size 7680x4320 -r 60 -i 8k2D_musician_7680x4320_8bit_30Hz_P420.yuv -c:v libsvt_hevc -vframes 2000 (-rc 1) -y 0.265	74.699	87.695	17.99	21.12
VBR		29.676	29.539	7.147	7.114

Austin-Hu commented 4 years ago

Even though memcpy for input frame takes “much” (~15 ms for 8K input) time for each frame, it takes much longer time (no big difference with different “-pred-struct” values) when each frame goes through the whole encoding pipeline, as what “Average Latency” shows after running SvtHevcEncApp:

Command Line	Time Consumed By Each Frame (ms)
	CopyFrameBuffer	Encoding Pipeline
		LAD 17	LAD 1
./SvtHevcEncApp -i Fallout4_1920x1080_8bit_60Hz_P420.yuv -w 1920 -h 1080 -n 10000 -b 0.265	~1	~700	~680
./SvtHevcEncApp -i rally_8192x4096_30.yuv -w 8192 -h 4096 -n 10000 -b 0.265	~15	~3000	~2400

So even though CopyFrameBuffer takes \~0 ms after applying this PR, it couldn’t help to improve the encoding performance. And this PR would only be helpful, in low latency mode where encoding a frame takes much less time (for example, 100~200 ms).

intelmark commented 4 years ago

Given the limited use case of this PR, the team thought best to close it for now. In most cases the CopyFrameBuffer percentage to overall encoding time is quite low, hence the total latency is not reduced much.