CESNET / UltraGrid

UltraGrid low-latency audio and video network transmission system
http://www.ultragrid.cz
Other
492 stars 53 forks source link

Decklink: audio requires large delay for sync. #268

Closed alatteri closed 1 year ago

alatteri commented 1 year ago

Hello,

I am seeing a large amount of audio delay needed for Decklink. I've measure using the Catchin Sync app for iPhone. Audio is ahead of video.

All tests were run with this receiver command: uv -d decklink -r analog -P 5004 --param decoder-use-codec=R12L

See the following compressions and delays:

Note I've set both x265 and SVT_HEVC to use 2 frame lookahead (equivalent) which inherently adds a 2 frame audio offset. Normally I use the --audio-filter delay:2 parameter on the encoder but wanted this to be as simple as possible.

Input:2K@23.98p SDI

Delay: 735 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libx265:crf=20:threads=2f -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

Delay: 975 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libsvt_hevc::la_depth=2:preset=10:pred_struct=0:gop=24:qp=20 -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

Input:2K@24p SDI

Delay: 720 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libx265:crf=20:threads=2f -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

Delay: 970 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libsvt_hevc::la_depth=2:preset=10:pred_struct=0:gop=24:qp=20 -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

Input:UHD@23.98p SDI

Delay: 730 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libx265:crf=20:threads=2f -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

Delay: 1035 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libsvt_hevc::la_depth=2:preset=10:pred_struct=0:gop=24:qp=20 -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

Input:UHD@24p SDI

Delay: 760 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libx265:crf=20:threads=2f -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

Delay: 1000 ms ./UltraGrid.AppImage --tool uv -m 1316 -t decklink:device=0:codec=R12L -c libavcodec:encoder=libsvt_hevc::la_depth=2:preset=10:pred_struct=0:gop=24:qp=20 -s embedded --audio-capture-format channels=8 --audio-codec=AAC:bitrate=256K 10.55.118.51

MartinPulec commented 1 year ago

Hi, can I ask what is actually the request - that the latency of compression is too high? The point is that whether the compression adds something like a second of video delay, audio needs to be delayed obviously as well.

The end-to-end delay of SVT HEVC is actually much more than 2 frames, see below (new UG needed for gray=1):

uv --capture-filter color -t testcard:size=64x64:pattern=gray=1 -c \
   libavcodec:encoder=libsvt_hevc:la_depth=2:pred_struct=0 -p color -d dummy -V

[1669646885.309] [color pp] Center color is Y=40 U=128 V=128
[1669646885.346] [color cap. f.] Center color is Y=61 U=128 V=128

meaning latency of 21 frames (or 20 with la_depth=0). This wasn't tuned until now in UG so blend performance (without opts) was even worse, we could perhaps set la_depth=0:pred_struct=0 by default.

Interestingly, x265 also gives currently latency of about 14 frames even though zerolatency is turned on, which is not good at all. For comparison, libx264 gives latency of about 1 frame in similar setup.

Note I've set both x265 and SVT_HEVC to use 2 frame lookahead (equivalent) which inherently adds a 2 frame audio offset.

Didn't you mean rather x265-params=rc-lookahead=2 than threads=2F? I would personally not recommend the thread=2F - it is effectively setting 2 threads instead of automatic thread count (which is number of logical cores). The 'F' flag is ignored, because x265 doesn't advertise frame parallelism (although it supports that), I believe that x265-params=frame-threads=<n> would enforce frame parallelism (disabled otherwise by zerolatency).

alatteri commented 1 year ago

I should have stated this in the original issue... A few months back (prior to drift_fix merge) the amount of audio delay needed (at least in x265 as I wasn't using SVT until recently), got much larger. I used to run somewhere around ~175ms, and now it needs around ~700ms.

Didn't you mean rather x265-params=rc-lookahead=2 than threads=2F? I would personally not recommend the thread=2F - it is effectively setting 2 threads instead of automatic thread count (which is number of logical cores).

In my experience, the "frame threads" also induce it's own latency equivalent to N. Notice also the thread pool is still automatically allocating all 32 threads (Ryzen 5950), and automatically setting lookhead=0.

-c libavcodec:encoder=libx265:crf=20:threads=2f

x265 [info]: Thread pool created using 32 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 2 / wpp(34 rows)
x265 [info]: Coding QT: max CU size, min CU size : 32 / 16
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge         : dia / 57 / 0 / 2
x265 [info]: Lookahead / bframes / badapt        : 0 / 0 / 0
x265 [info]: b-pyramid / weightp / weightb       : 0 / 0 / 0
x265 [info]: References / ref-limit  cu / depth  : 1 / off / off
x265 [info]: Rate Control / qCompress            : CRF-20.0 / 0.50
x265 [info]: tools: rd=2 psy-rd=2.00 early-skip rskip mode=1 tmvp cip
x265 [info]: tools: fast-intra strong-intra-smoothing lslices=6 deblock

-c libavcodec:encoder=libx265:crf=20:x265-params=rc-lookahead=2. "frame-threads" set to 5 would add N + "Lookahead" frame latency.

x265 [info]: Thread pool created using 32 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 5 / wpp(34 rows)
x265 [info]: Coding QT: max CU size, min CU size : 32 / 16
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge         : dia / 57 / 0 / 2
x265 [info]: Lookahead / bframes / badapt        : 2 / 0 / 0
x265 [info]: b-pyramid / weightp / weightb       : 0 / 0 / 0
x265 [info]: References / ref-limit  cu / depth  : 1 / off / off
x265 [info]: Rate Control / qCompress            : CRF-20.0 / 0.50
x265 [info]: tools: rd=2 psy-rd=2.00 early-skip rskip mode=1 tmvp cip
x265 [info]: tools: fast-intra strong-intra-smoothing lslices=6 deblock

Here with no parameters, "frame-threads" still 5,

-c libavcodec:encoder=libx265

x265 [info]: Thread pool created using 32 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 5 / wpp(34 rows)
x265 [info]: Coding QT: max CU size, min CU size : 32 / 16
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge         : dia / 57 / 0 / 2
x265 [info]: Lookahead / bframes / badapt        : 0 / 0 / 0
x265 [info]: b-pyramid / weightp / weightb       : 0 / 0 / 0
x265 [info]: References / ref-limit  cu / depth  : 1 / off / off
x265 [info]: Rate Control / qCompress            : CRF-22.0 / 0.50
x265 [info]: tools: rd=2 psy-rd=2.00 early-skip rskip mode=1 tmvp cip
x265 [info]: tools: fast-intra strong-intra-smoothing lslices=6 deblock
MartinPulec commented 1 year ago

Thanks for the info, the x265 info output was much helpful. I didn't know that even if I do not set frame threading explicitly, x265 toggles it on internally, if thread_nr > 1 or 0 (auto), as it is currently. I've already disabled it, but of course, it is at the expense of of throughput. Of course you can set the threads by an option.

Problem 2 was added latency on receiver caused by ec702d5 (referenced in GH-241), so I've disabled is as well - I did the evaluation in a wrong way then, I used 16 pictures comparison "window" and the added latency was something around that number so it "wrapped around" and looked that the latency is small but it wasn't. Also, as in previous paragraph - it is a performance/throughput tradeoff. Implicitly latency is preferred now (because it can be harder to see), if needed, the user can decide to allow frame threading and how much of threads (and so delay) to use.

I believe that the latency of x265 should be decent now. Also libvpx has greatly improved by setting rc_lookahead=0. SVT has improved somehow by disabling frame threads on decoder, but it is still not low-latency. The problem may be that it respawns 96 threads (at minimum!) - whether those are frame threads, it will certainly hit latency. I didn't find any other latency-helping options in addition to those you've already discovered.

alatteri commented 1 year ago

Thanks Martin... I'll try to test the changes today.

alatteri commented 1 year ago

regarding commit 6716c55cc97211b6217564f9011a003be55eb8eb

This is great for x265 latency with HD/2K material, but results in "Your computer may be too SLOW to play this !!! [lavc hevc @ 0x7f2b34036440] Could not find ref with POC 6" with SVT_HEVC UHD/4K material. Can this be a user runtime toggle so we can chose what is best for the situation? Even better, a resolution based toggle, such that if stream is less than a defined resolution, it uses one method of decode, and if greater than, it uses the other, even with the latency penalty. I'd rather have a latency UHD stream, then an unplayable stream.

alatteri commented 1 year ago

I've added parameter lavd-thread-count=6f on the receiver, and it reduces latency BY 400ms over using 0f (auto), and is performant enough, at least with my decoder hardware, to not drop frames. I tried using Ns (slice) and Nfs which should be (slice+frame) but I saw no improvement in decode over just Nf

Ideally there would be some mechanism for UG to choose the best decode options based of input resolution or performance so it doesn't induce latency when not needed.

I've tried a few options for SVT_HEVC encode but nothing improved the decode situation.

Also this reduction of 400ms makes sense as decoder CPU has 16 threads.... Each frame is about 40 ms. By specifying 6f instead of auto, the decoder is using 10 less threads, so 4*10threads= 400ms reduction.

MartinPulec commented 1 year ago

(FYI, first of all sorry, I didn't push the changes to master repo yesterday, I've done this right now)

This is great for x265 latency with HD/2K material, but results in "Your computer may be too SLOW to play this !!! [lavc hevc @ 0x7f2b34036440] Could not find ref with POC 6" with SVT_HEVC UHD/4K material.

You mean if the source is NVENC? I have an idea that other streams (eg. made by libx265) can be decoded reasonably with slilce-based paralellism. For NVENC, it is still the same problem if you remember - it looks like it should be possible to create multiple slices (as NVENC H.264 does), it doesn't trigger parallel decoding in FFmpeg HEVC decoder (problem can be on either of both sides).

Can this be a user runtime toggle so we can chose what is best for the situation?

Yes, that was exactly the idea. It should be --param lavd-thread-count=0FS. (0 == auto threads, S slice, F frame). I'll try to look if I provide a hint on command-line. Not sure if/which decision logic to implement - the thing is that x265-encoded stream still should be possible to decode without frame parallelism. But it's open - also other metrics can be deployed (like number of slices if I can get it from the stream).

I've added parameter lavd-thread-count=6f on the receiver, and it reduces latency BY 400ms over using 0f (auto)

it exactly as you noted later in last paragraph

tried using Ns (slice) and Nfs which should be (slice+frame) but I saw no improvement in decode over just Nf

the syntax is slightly differnet - 'n' shouldn't be combined with neither 's' of 'f', I even don't know what it should mean, basically:

It is just that slice is enabled by default unless other flag is given (that is why there is 'n').

Ideally there would be some mechanism for UG to choose the best decode options based of input resolution or performance so it doesn't induce latency when not needed.

Well, as noted above - the resolution on one hand is not the single metric that allows to asses that the decoder won't be able to decode the stream. Performance is a good idea, but if it were the performance of decoding, it first needs to be started, decoder possibly re-created resulting in visual glitches.

I've tried a few options for SVT_HEVC encode but nothing improved the decode situation.

As I wrote first, the code was not yet pushed so you can try with current code. But basically there on one hand is latency reduction by disabling decode frame parallelism but you'd perhaps need to enable it back anyways.

alatteri commented 1 year ago

This is great for x265 latency with HD/2K material, but results in "Your computer may be too SLOW to play this !!! [lavc hevc @ 0x7f2b34036440] Could not find ref with POC 6" with SVT_HEVC UHD/4K material.

You mean if the source is NVENC? I have an idea that other streams (eg. made by libx265) can be decoded reasonably with slilce-based paralellism. For NVENC, it is still the same problem if you remember - it looks like it should be possible to create multiple slices (as NVENC H.264 does), it doesn't trigger parallel decoding in FFmpeg HEVC decoder (problem can be on either of both sides).

SVT_HEVC is the only way 12bit UHD 12G SDI can be encoded currently because of this issue: https://github.com/CESNET/UltraGrid/issues/248 Probably has something to do with SVT_HEVC uses YUV pixfmt and NVENC is using RGB16.

I don't think I will be able to test the latest code today, probably Thursday.

MartinPulec commented 1 year ago

Okay, I've performed further research but the observations are not entirely positive:

Possibilities:

alatteri commented 1 year ago

Hi Martin,

Thank you for taking the time to look into this further. Much appreciated.

Okay, I've performed further research but the observations are not entirely positive:

  • SVT and NVENC do encode slices; but
  • FFmpeg HEVC decoder doesn't parallelize slice/tile decoding but it does WPP
  • libx265 produces wavefront-parallel stream meaning that its decoding is parallelized with FFmpeg while SVT and NVENC doesn't

Possibilities:

  • (obvious) use frame parallelism where needed (encode/decode)

This is seeming to be the most reliable course of action here. Now that I know I can manually set N frame decoding, being able to reduce it to 6 is much better than "auto" of 16.

  • splitting the frame manually - "--capture-filter split:2:2 -t ...". The point is, that the picture is split to 4 completely independent tiles that are encoded/decoded in parallel. A slight problem is that there may arise artifacts on the tiles' boundaries.

That is very interesting. In the past, I had tried using the x265 built-in slices parameter, and yes, there were clearly visible artifacts at the boundaries. But I will definitely try the capture-filter method with SVT_HEVC.

  • @mpiatka found libde265 decoder - it looks a bit promising because it can do both WPP and slice parallelism but available FFmpeg wrapper is dated, so we'd need to update (if we want to avoid to use it directly); also, it doesn't support neither 10-bit nor 4:2:2 (obviously only 4:2:0 then)

unfortunately, if 10bit 444 is not available as a minimum, then it doesn't fit our usecase. Thank you for researching that.

  • in threory, would _hevccuvid work for you if it decodes 10-bit? --param force-lavd-decoder=hevc_cuvid --param decoder-use-codec=R12L works for me but a Turing card is needed (the codec auto-selection is not working if decoder-use-codec is not provided and UYVY is used)

We've been using Intel NUCs connected to Blackmagic UltraStudio 4K Mini via thunderbolt. So no NVDEC is available. I do have a new NUC12, and that has Xe graphics.

NUC12WSKi7 Full with this processor:
https://www.intel.com/content/www/us/en/products/sku/226254/intel-core-i71260p-processor-18m-cache-up-to-4-70-ghz/specifications.html

According to https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video , that CPU (Alderlake based) should be able to do hardware HEVC 10bit 444 maybe even 12bit, vaapi/qsv. But I've never been able to get those working encode/decode. And also, not sure if ffmpeg has switched over to onevpl or if master branch is still using libmfx. I've been utilizing headless Ubuntu Server based installs, I'll install Desktop Minimal and see if that changes anything. Won't be able to do this until the weekend.

Thanks again as always for taking the time.

alatteri commented 1 year ago
  • splitting the frame manually - "--capture-filter split:2:2 -t ...". The point is, that the picture is split to 4 completely independent tiles that are encoded/decoded in parallel. A slight problem is that there may arise artifacts on the tiles' boundaries.

I tried this. At 2:2 encoder could only do 22fps. at 2:1 encoder could keep up with 24fps, but I still had to have receiver set to lavd-thread-count=4f, which in the end, saving 2 frames of latency (since a normal stream requires 6f), for a much greater risk of something going wrong just isn't worth it. I saw a few instances where the slices would become 1 frame out of sync, also the Decklink buffer issue.

As noted in commit c554fc707ff608cffb1d68aa32551bcd096cf5b7 the umv parameter seems to make no difference.

So barring, CUVID which has hardware challenge, or QSV (still to be tested), I think for 4K, SVT_HEVC with the inherent additional 6f latency is what we got. HD/2K is great with x265.

MartinPulec commented 1 year ago

if 10bit 444 is not available as a minimum

I was wrong, I cited an obsolete PDF referenced from web, currently it actually supports that. I've added it to Linux FFmpeg builds. But when tested, the performance doesn't seem to improve significantly in tested scenarios. But it would likely differ in different conditions, just if you want to try out, it should be invoked by --param force-lavd-decoder=libde265.

I tried this. At 2:2 encoder could only do 22fps. at 2:1 encoder could keep up with 24fps

Interesting, it may be actually too much threads, because it uses nr of logical cores for each stream, be default, setting thread count to ¼ might help. Going out-of-sync is definitely a problem - maybe it could be treated but yeah, I understand.

I am afraid that I'll have to agree with your conclusions - NVENC/SVT use slices which doesn't FFmpeg homebrew decoder handle. The libde265 looked great on paper but I didn't see a significant speedup against the default one even for sliced videos.

alatteri commented 1 year ago

Thanks Martin.... I will try libde265 and reducing threads this weekend.

MartinPulec commented 1 year ago

Great, it is up to you, but if you find something out, I'd be glad. Actually, I was a slightly disappointed from libde265 because I hoped that it - since it supports slice-based parallelism - may improve the performance but it wasn't the case for me.

alatteri commented 1 year ago

I've tested decoding with --param force-lavd-decoder=libde265 and saw no improvement.

alatteri commented 1 year ago

I think we have done what we can regarding this issue. Commit 6716c55cc97211b6217564f9011a003be55eb8eb fixes the latency issue with x265, and we know why SVT_HEVC needs to use threads=Nf. So I'm going to close this issue. Thank you for the time taken. The dialog we have had gave me greater insight into being able to optimally configure my streams.