Latency on hw H264 decode

webcastno commented 1 year ago

Hi,

We have implemeted xilink ffmpeg into our application. When using hw h264 decode (so using the Xilink hw) we see about 260ms latency increase compared to using the sw x264 decode. We have set the low latency params (and that also made a lot of difference about 300ms), but still we see this large decode latency increase.

Do you have any idea of any other params that we must set or could try to decrease the latency?

Thanks in advance.

Jan

NastoohX commented 1 year ago

Hi Jan, Thank you for bringing this issue to our attention. To help with debugging process, response to the following will be useful: 1- Running environment, i.e., on-prem or VT1, OS, Kernel and vsdk version 2- A representative FFmpeg cli that exhibits this problem 3- A brief description on how latency is being measured and whether this measurement aligns with "-latency_logging" if applicable. 4- If relevant parts of "journalctl" can be provided, it would also will be useful Cheers,

webcastno commented 1 year ago

Hi,

Here are the answers to your questions:

1: AWS vt1 instance, Ubuntu 18.04, Kernel: 5.4.0-1096-aws, video-sdk: 2.0 2: We are using FFMPEG from our code, so I don't have an example CLI. 3: At the moment we can only check the decode latency, so the latency_logging parameter is not of any use 4: Also not relevant for decode latency

We use a clolor bar with timecode burnt in video (1080p50, h264) and we get the following round trip latencies:

sw decode: 12 frames latency
xilinx decode, low latency disabled: 50 frames latency
xilinx decode, low latency enabled: 46 frames latency

For the sw decoder we use the builtin h264 decoder, and for xilinx we use mpsoc_vcu_h264 and the corresponding low_latency param. For the mpsoc_vcu_h264 we also have to use a filter graph with a xvbm_convert to convert the pixelformat to yuv.

Is it expected that the xilinx decoder has much higher latency than the sw decoder? If no, what can be done to improve it? We're sending video to the instance, decode there, encode again and send it back over the internet. This is where we can compare latency versus the original feed that went out. So don't be alarmed by the absolute latencies, we're just concerned about the differences.

Thanks, Jan

NastoohX commented 1 year ago

Hi, Couple of points to note: (Kindly, take a look at https://xilinx.github.io/video-sdk/v2.0/using_ffmpeg.html#video-decoding and https://xilinx.github.io/video-sdk/v2.0/c_apis.html?highlight=splitbuff_mode#decoder-properties for more details.) 1- low_latency flag assumes that there are no B frames in the gop. 2- Setting splitbuff_mode to 1 will decrease latency, on the assumption that a complete AU can be supplied to the decoder, e.g. AUD to AUD 3- If you could provide the parameters that are passed to your decoder, it would be helpful 4- latency_logging is applicable to all types of pipelines, including decode only. As an example, using our latency_test.sh utility, you can measure latency on different portions of the pipeline. Here is an example: a) cp /opt/xilinx/examples/u30/ffmpeg/quality_analysis/latency_test.sh /tmp b) cd /tmp c) execute a decode only pipeline, which sets latency_logging to 1, e.g., ffmpeg -c:v mpsoc_vcu_h264 -latency_logging 1 -i test.mp4 -y -vf xvbm_convert -pix_fmt yuv420p -f rawvideo /dev/null ... Output #0, rawvideo, to '/dev/null': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2avc1mp41 encoder : Lavf58.76.100 Stream #0:0(und), 0, 1/25: Video: rawvideo, 1 reference frame (I420 / 0x30323449), yuv420p(left), 3840x2160 (0x0) [SAR 1:1 DAR 16:9], 0/1, q=2-31, 2488320 kb/s, 25 fps, 25 tbn (default) Metadata: handler_name : VideoHandler vendor_id : [0][0][0][0] encoder : Lavc58.134.100 rawvideo ... d) Enter sudo grep dec_handle /var/log/syslog and note the last entry, e.g. ... Mar 3 12:52:15 xsjbollaert10 XMA_Decoder[18126]: xma_dec_handle = 0x5620562dda00 Mar 3 12:53:25 xsjbollaert10 XMA_Decoder[21797]: xma_dec_handle = 0x55f4eb0f0a00 Mar 3 12:54:18 xsjbollaert10 XMA_Decoder[25167]: xma_dec_handle = 0x558358cd2a00

e) Run the utility, using the last noted pid sudo ./latency_test.sh /var/log/syslog 25167 rm: cannot remove '*.log': No such file or directory Ladder contexts = 0x558358cd2a00
================== Generating logs for /var/log/syslog log file ========================= Usage: grep [OPTION]... PATTERN [FILE]... Try 'grep --help' for more information. Usage: grep [OPTION]... PATTERN [FILE]... Try 'grep --help' for more information. =============== Done generating logs! Measuring now... ======================= Frames decoded = 500 Average decoding latency = 180 ms ============== decoder done =============== Frames encoded = 0 Average encoding latency = 0 ms ...

(You'll note that if splitbuff_mode is applicable, then Average decoding latency will decrease.) For the above PAL example, total decode delay is less than 5 frames without splitbuff and 4 with it set. Let us know if this helps. Cheers,

webcastno commented 1 year ago

Hi,

I'm only passing low_latency as a parameter to the decoder. I've tried the splitbuff_mode, and it improves the latency by some amount. I did run the latency_test.sh script on my decoder and the average latency is around 30ms with splitbuf_mode on and around 180ms with splitbuf_mode off. But that's not the latency I'm seeing in my video. Maybe you are measuring the latency of the hw decoder and not the latency of the ffmpeg implementation?

Cheers, Jan

NastoohX commented 1 year ago

Hi Jan, Indeed, our script only measures delays from CPU to the end of decode operation. (Delays due to transfer of uncompressed data from device to CPU is not included, as they are not applicable to transcode pipeline, where data remains on the device.) Are you able to provide a sample of your video clip? If not, would you be able to describe its gop structure, resolution and FPS? BTW, did you note the B frame limitation of low_latency flag? Cheers,

webcastno commented 1 year ago

Hi,

In the encoder config used for this test our encoder should not produce B frames. We are using the AVIOContext in ffmpeg, so we can easily create a recording of the stream by dumping out everything to a file. Attached you can find it, https://www.dropbox.com/s/8csbz22skcsgly0/rx_sv_09-03-2023%2010-59-38_out.h264.mpeg?dl=0

NastoohX commented 1 year ago

Hi, Thank you for providing this clip. We will run some analysis on it and get back to you as soon as we have something to report. Cheers,

webcastno commented 1 year ago

Hi,

Any feedback on the clip or any other suggestions for us?

Thanks, Jan

NastoohX commented 1 year ago

Hi, Sorry for delay in our response. This issue has been brought up with our R&D team and we will provide feedback, as soon as possible. Cheers,

NastoohX commented 1 year ago

Hi, After discussions with our R&D and further verification, we have an explanation for what you are observing, in terms of end to end delay, in your pipeline. There are 2 components that contribute to this delay, 1) our XMA engine initialisation and 2) FFmpeg startup. Building FFmpeg, as per https://xilinx.github.io/video-sdk/v2.0/using_ffmpeg.html#rebuilding-ffmpeg, to include h264 decoder, will apply the same delay to both pipelines plus the additional XMA setup, in case of mpsoc_vcu_h264 decoder, e.g,

FFmpeg/ffmpeg -hide_banner -loglevel fatal -y -re -threads 1 -c:v h264 -i rx_sv_09-03-2023-10-59-38_out.h264.mpeg -frames:v 18000 -pix_fmt nv12 -f rawvideo /dev/null stopped @ 1679698032900. Duration 35175

FFmpeg/ffmpeg -hide_banner -loglevel fatal -y -re -xlnx_hwdev 1 -c:v mpsoc_vcu_h264 -latency_logging 1 -low_latency 1 -splitbuff_mode 1 -i rx_sv_09-03-2023-10-59-38_out.h264.mpeg -frames:v 18000 -vf xvbm_convert -pix_fmt nv12 -f rawvideo /dev/null stopped @ 1679705000962. Duration 35298

Furthermore, given that these delays are due to the initial setup, they are independent of the number of frames:

FFmpeg/ffmpeg -hide_banner -loglevel fatal -y -re -threads 1 -c:v h264 -i rx_sv_09-03-2023-10-59-38_out.h264.mpeg -frames:v 1 -pix_fmt nv12 -f rawvideo /dev/null stopped @ 1679705370703. Duration 910

FFmpeg/ffmpeg -hide_banner -loglevel fatal -y -re -xlnx_hwdev 1 -c:v mpsoc_vcu_h264 -latency_logging 1 -low_latency 1 -splitbuff_mode 1 -i rx_sv_09-03-2023-10-59-38_out.h264.mpeg -frames:v 1 -vf xvbm_convert -pix_fmt nv12 -f rawvideo /dev/null stopped @ 1679705370834. Duration 1041

The observed ~100-200 [ms] extra delay in U30 is due to 1) further XMA setup and 2) PCIe transfer and other pixel format conversion processing. The first item, is a one time delay cost. Although, using FFmpeg cli, there is no way to avoid this; however, using our direct C APIs, a pipeline can be arranged in a way to buffer frames, up to the point where XMA has been primed, and then skip over to the end of the buffer. Alternatively, it is possible to have a pool of initialised pipelines pre-configured, ahead of time. In both of these approaches, the main idea is to take the initial delay out of the main processing loop.

Hope this clarifies the origin of the observed differences between CPU and U30 based decoding. Cheers,

NastoohX commented 1 year ago

Hi, Kindly, let us know if the above resolved you issue. Cheers,

webcastno commented 1 year ago

Hi, we are currently working on something else, but will try to see what we can optimise using your information. We cannot unfortunately use the C API, that would mean a too large rebuild at this stage. There may be some other things we can do though. What puzzles me is that we do not see the same problem on encode using the U30, there seem to be no added latency there compared to using CPU. At least the transfer time to and from the card should be the same right?

Cheers, Jan

NastoohX commented 1 year ago

Hi, Waiting for SPS/VUI to arrive, perhaps till the next IDR, would be part of why there will be a larger delay in decode. Cheers,

NastoohX commented 1 year ago

Hi, Closing this ticker for now. Feel free to reopen it or create a new one, when needed. Cheers,

Xilinx / video-sdk

Latency on hw H264 decode #62