CESNET / UltraGrid

UltraGrid low-latency audio and video network transmission system
http://www.ultragrid.cz
Other
493 stars 53 forks source link

12Bit 4:4:4 4K Decklink output performance issues #406

Open slm-sean opened 3 weeks ago

slm-sean commented 3 weeks ago

Hi,

I recently ran into an issue where decoding and outputting 12Bit 4:4:4 4K at 24FPS on a Decklink 8k pro mini appears to have significant performance issues. We are using the Comprimato J2K library and noticed that when outputting 12bit 4:4:4 at 4K DCI, our systems have a single core pegged at 90% to 100% utilization. With some of our hardware configurations (dual xeon gold 5120 cpus) they are unable to maintain realtime playback. If I remove the decklink output option, single core utilization on our system drops significantly to around 50% highest.

What is curious is that these systems are able to handle encoding 3 4kDCI 12bit 4:4:4 streams concurrently, which I expect to be the harder process, but it appears that something about outputting to a blackmagic device is more intensive than capturing.

System Configuration: Dual Xeon Gold 5120 2.20GHz 16GB RAM Ubuntu 22.04 Ada 4000 GPU

System with no blackmagic output enabled:

image

System with blackmagic output enabled: image

alatteri commented 3 weeks ago

What BMD driver?

slm-sean commented 3 weeks ago

We have tried both 12.9 and 14.1. Was going to try 14.2 today as well, but i don't expect that to make a difference.

alatteri commented 3 weeks ago

full command syntax on both sender and receiver?

slm-sean commented 3 weeks ago

We are currently using the gui, and manually editing the command string. Here is what we are currently using.

Sender:

--capture-filter preview:key=d8247m8m -t decklink:connection=SDI:device=1 -c cmpto_j2k:quality=1:mem_limit=1000000000:rate=160M -d preview:key=d8247m8m --audio-filter controlport_stats -s embedded --audio-capture-format channels=2 --audio-codec PCM -r dummy -P 5008 --param use-hw-accel,errors-fatal

Receiver:

-capture-filter preview:key=v6vo1vke -d multiplier:decklink:device=0:single-link#preview:key=v6vo1vke --audio-filter controlport_stats -r embedded --control-port 0 --param use-hw-accel,errors-fatal

We have systems built on old z440s that can clock up to 3.5Ghz which are capable of doing a single stream decode while outputting to a BMD 8k pro mini that have used these commands.

alatteri commented 3 weeks ago

Have you tried without the capture preview stuff? Whenever I am trouble shooting, I start with the bare minimum command syntax and then build up during tests

slm-sean commented 3 weeks ago

Yep, I have disabled previews.

My next set is trying to re-compile with the AJA SDK to see if there is lower cpu utilization, but I don't think i have a card available that can handle 4k 12bit from aja right now.

alatteri commented 3 weeks ago

We output 4K 12bit 444 using BMD UltraStudio Mini 4K using an Intel NUCi713 without issue.

Try the basics first: VIDEODEVICE="decklink:synchronized" UVOPTPARAMS="--param use-hw-accel,resampler=soxr,decoder-use-codec=R12L" uv -d $VIDEODEVICE $UVOPTPARAMS

alatteri commented 3 weeks ago

also, you are calling --audio-filter but looks like not really doing anything with it

slm-sean commented 3 weeks ago

I will give these a try. Thank you for the advice. Will report back

slm-sean commented 2 weeks ago

I tried your suggestions, but I am still getting dropped frames. Including the synchronized flag seemed to introduce audio buffer overflows in the log. Do you experience that with your systems?

I wonder if it could be related to the type of I/O card I am using.

Are you also using the Comprimato J2K codec for your streams or are you doing HVEC / some other codec? Perhaps it's a combination of the two that's causing my performance issues.

alatteri commented 2 weeks ago

x265. No issues on output.

MartinPulec commented 2 weeks ago

Hi,

I recently ran into an issue where decoding and outputting 12Bit 4:4:4 4K at 24FPS on a Decklink 8k pro mini appears to have significant performance issues. We are using the Comprimato J2K library and noticed that when outputting 12bit 4:4:4 at 4K DCI, our systems have a single core pegged at 90% to 100% utilization. With some of our hardware configurations (dual xeon gold 5120 cpus) they are unable to maintain realtime playback.

I believe that the issue is caused by the UltraGrid internal conversion to R12L pixel format for BMD,which is run on CPU.

If I remove the decklink output option, single core utilization on our system drops significantly to around 50% highest.

Sure, because if there is no display, video is not decoded at all.

What is curious is that these systems are able to handle encoding 3 4kDCI 12bit 4:4:4 streams concurrently, which I expect to be the harder process, but it appears that something about outputting to a blackmagic device is more intensive than capturing.

I've just pushed to the Git an update that runs the conversion in parallel. If I had identified the problem correctly, it should help in your case.

Ideally, the conversion could be done on the GPU directly - I may take a look on this later.

slm-sean commented 2 weeks ago

Hi Martin,

That appears to have done the trick. I just deployed the new build to one of my systems that was having the issue and per core CPU utilization is staying below 80%, and I'm getting full 24FPS on the decompression side.

Thank you for the quick fix and if you do you implement the conversion on the GPU, I would be more than glad to assist in testing if you need it.

I will continue testing and reporting back if I run into any more issues.

MartinPulec commented 2 weeks ago

Hi, the kernel for the compression is already in the Git (the commit above).

It requires CUDA toolkit for the compilation (can be ensured with --enable-cuda; otherwise the cmpto_j2k UG codec module doesn't strictly require CUDA presence, although it is better).

I am going to add the symmetric CUDA kernel to the encoder today as well.

MartinPulec commented 2 weeks ago

I am going to add the symmetric CUDA kernel to the encoder today as well.

Done - there are also other improvements, mainly that on the encoder, the frame is copied directly to the GPU memory. As there are multiple changes and the complexity somehow increased, I'd be definitely glad if you could test.

Please note that the encoder changes work only if single CUDA device is used (the -D UG parameter) - I got some trouble when trying more but I believe that multi GPU encode is not much required today, since the GPUs got much more powerful than 10 years ago. Nevertheless, the original way is used as a fallback for multi-GPU setup.

slm-sean commented 2 weeks ago

This is amazing, thank you for this. I will look at testing this next week and will report back.

I believe you are correct, multigpu is not as important. We do use multiple GPUs in a single system, but we would only assign 1 gpu per encode as we are trying to create a denser infrastructure.

slm-sean commented 1 week ago

Hi @MartinPulec, I recompiled with the last commits and I'm getting the following error when launching ultragrid.

[lib] Library /tmp/.mount_UltraGLfAljA/usr/bin/../lib/ultragrid/ultragrid_vcompress_cmpto_j2k.so opening warning: /tmp/.mount_UltraGLfAljA/usr/bin/../lib/ultragrid/ultragrid_vcompress_cmpto_j2k.so: undefined symbol: _Z23preprocess_r12l_to_rg48iiPvS_

I did include the --enable-cuda flag when running the autogen script.

Not sure I'm missing something during the compile, please let me know if this is an issue on my end.

Thanks, Sean

MartinPulec commented 1 week ago

Thanks for the info, I forgot to add the kernel object also to the compress module (the problem is only if --enable-plugins is used, which I didn't for the test). It should be fixed now.

slm-sean commented 5 days ago

Hi @MartinPulec,

Got some test time in today with the new build and that fixed the previous issue.

I am now getting the following warning when i am encoding a 444 12bit stream.

image

It appears that the GPU encode with the conversion being handled by the GPU is maxing out our GPUs.

image

This conversion appears to be very taxing for a 1080ti. I will try this on another card later tonight when I have a different system available.

If this is the case and there are no further optimizations that can be done, I think we may require a flag to disable / enable the conversion in GPU to taking place for systems with lower GPU resources.

Cheers, Sean

alatteri commented 5 days ago

which program gives you this usage report?

slm-sean commented 5 days ago

It's called NVtop. I cut out some of the interface, but it shows per process information as well.

https://github.com/Syllo/nvtop

If you compile Btop from source, you can also get GPU utilization, but NVtop has more details.

https://github.com/aristocratos/btop

MartinPulec commented 4 days ago

Hi @slm-sean ,

If this is the case and there are no further optimizations that can be done

It is just the opposite - this is just a an initial version. That's actually why there is the warning (I've actually tested it just with FullHD so far and there the duration looked ok, but it looks like not for the higher resolutions). This version was not optimized at all so there is plenty of space for that.

I've just optimized storing/loading the samples in the CUDA kernel and it seems that the duration decreased from 16.6 to 0.6 ms for me using -t -t testcard:codec=R12L:size=dci4, also with 1080 Ti. (Depending on the complexity of the CUDA kernel, the loads/stores can have the far biggest impact.)

Could you please try this (either encoder or decoder, it doesn't matter) to measure GPU conversion duration:

uv <your_args> --verbose=debug |  grep 'elapsed time\|pixfmt conversion duration'

compared to CPU conversion duration:

uv <your_args> --verbose=debug --param j2k-enc-cpu-conv,j2k-dec-cpu-conv | grep 'elapsed time\|pixfmt conversion duration'

(Please note that the (any) --param is considered considered as a development option and may be anytime removed/changed/etc. If we conclude that the CPU conversion is better, I'd rather make that default.)

Then you can check which version is faster now. In my case now it is the GPU. but it really depends on the complexity of the content, since the GPU is used for both conversion and compression - from the conversion point of view the content doesn't matter but the compression complexity is hugely affected by the content.

slm-sean commented 4 days ago

Fantastic, thanks Martin. When i get a moment today, i will test this and report back. I have a fairly intensive test that involves heavy film grain that we have been using that exposed this bottleneck originally.

Cheers, Sean

slm-sean commented 4 days ago

Hi Martin,

Managed to get a quick test in and the problem persists but it looks to have improved as the rate of the warnings has reduced significantly.

image

I noticed that dropping the quality setting with the cmpto_j2k encoder from .7 which is default to .6 eliminated the warnings.

image

GPU utilization has dropped from 100% down to 80% with the default quality setting.

Since i am getting full framerate on the encode side, is it same to assume that it is working as expected, but potentially increasing the overall latency of the stream? I noticed that in normal operation before any of this work that we have done here that 444 12Bit streams have an additional 3-4 frames of latency compared to 422 10 or 12 bit streams.

I will continue testing with a different GPU when one becomes available.

slm-sean commented 4 days ago

Sorry, forgot to also include the verbose grep results:

GPU=

image

CPU=

image

It appears that the GPU process has latency spikes, but overall are much lower than the CPU conversion.

MartinPulec commented 3 days ago

It appears that the GPU process has latency spikes, but overall are much lower than the CPU conversion.

Yes, I've noticed this when testing with a complex content as well, eg. -t testcard:codec=R12L:pattern=noise:mode=dci4. It is a bit unfortunate, because the GPU conversion is otherwise an order of magnitude faster than CPU. But the spikes will make it unusable. At least as the default option.

Please hold on few more days, I've yet another idea how to improve it but I don't know for sure if it fixes that.

slm-sean commented 3 days ago

Sounds good. I noticed that reducing the quality setting of the cmpto_j2k encoder seems to eliminate those spikes. It's possible that the latency spikes has more to do with process scheduling on the GPU when it's under load? I pretty much eliminate the spikes when reducing the quality down to .5 from the default .7 which lowers the utilization of the gpu drastically (from 85% down to 50-60%)

There are probably spikes in GPU utilization that my monitoring application is polling fast enough to report.

MartinPulec commented 2 days ago

Okay, I have another version to test. It seems to have mostly eliminated the spikes if you want to try out.

When using the debug timer, the convert time is most of the time 0.6 ms while for one frame in 1000 the conversion takes 7 ms. On the other hand, the CPU conversion takes approximately the same 7 ms always...

I haven't decided yet whether to keep the accelerated version default or not - it is true that although it takes just a small portion of GPU performance, it may be significant in cases like yours that the GPU performance is peaking near 100% GPU. On the other hand, there the opposite scenarios - it saves significant amount of CPU power, also PCI bandwidth (the R12L data is more terse than the converted ones).

As you write, the GPU load really matters. The quality seem to indeed have an impact on the compression performance (and thus GPU load), as well as the bitrate and content complexity.

MartinPulec commented 2 days ago

To answer you previous question/remark:

I noticed that in normal operation before any of this work that we have done here that 444 12Bit streams have an additional 3-4 frames of latency compared to 422 10 or 12 bit streams.

Seems to be true, evaluating the encoder with the following command:

uv -t testcard:pattern=noise:mode=dci4 -c cmpto_j2k:rate=10M -VV    # 8-bit 4:2:2

settles down at some 27 ms for me. "Compresed frame [...] duration:" in the output means the latency.

Adding :c=R12L (12b 4:4:4) to testcard increases the latency to some 207 ms. Just FYI, using c=RGB (8b 4:4:4) yields around 80 ms.

I think the isn't much related to the above (conversions), anyways (for RGB there is no conversion at all). I think it is related rather to the codec configuration. If you want to get this solved, feel free to open a new issue.