WITH_NVUTILS(maybe?) uses 20% more CPU

madsciencetist commented 1 year ago

I have two test setups, both with the same nvpmodel clock speeds:

Xavier AGX with Jetpack 4.6.1 (WITH_NVUTILS not defined)
Xavier NX with Jetpack 5.0.2 (WITH_NVUTILS defined)

Unfortunately I can't fill out the test matrix more because it it not easy for me to reflash either one, and ffmpeg crashes on the NX when I manually disable WITH_NVUTILS, which is its own bug. So I'm not really sure if the following differences are due to hardware, OS, or use of nvutils, and I'm hoping someone else can confirm or deny some of these numbers.

First, as a control, to rule out CPU speed differences, I use the software decoder, and show that the NX does not use more CPU:

ffmpeg -rtsp_transport tcp -i rtsp://my_2592x1944_hevc_stream -f rawvideo -pix_fmt yuv420p pipe: > /dev/null
Xavier: 135% CPU, NX: 130% CPU

Switching to the NVMPI decoder, we see that the NX is using substantially more CPU:

ffmpeg -c:v hevc_nvmpi -rtsp_transport tcp -i rtsp://my_2592x1944_hevc_stream -f rawvideo -pix_fmt yuv420p pipe: > /dev/null
Xavier: 29% CPU, NX: 49% CPU

This 1.7x difference grows to 5x as we speed up other steps:

ffmpeg -c:v hevc_nvmpi -resize 320x320 -rtsp_transport tcp -i rtsp://my_2592x1944_hevc_stream -f rawvideo -pix_fmt yuv420p pipe: > /dev/null
Xavier: 4% CPU, NX: 20% CPU

Does NvUtils use 5x the CPU of nvbuf_utils? Does Jetpack 5.0 use 5x the CPU of Jetpack 4.6? Profiling both did not give me any clear answers. Is anyone able to test Jetpack 5.0 with nvbuf_utils? Trying to figure out if this is a me problem, nvmpi problem or nvidia problem.

bmegli commented 1 year ago

First, as a control, to rule out CPU speed differences

Be careful with measuring CPU usage as this is relative to current CPU clock (top, htop and similar tools).

To compare different loads you always have to fix the CPU clock speed.

The easiest way on Jetson is jetson_clocks

This is especially important when comparing low loads where CPU governor may lower clock speed which results in higher reported load.

madsciencetist commented 1 year ago

@bmegli yes, I tweaked the nvpmodels to fix the clock speeds equally

madsciencetist commented 1 year ago

Turns out my NVDEC and EMC clocks were still running at different speeds, and my NX had a background process using the memory bus somewhat heavily. After fixing all that, the 1.7x and 5x differences reduced to 1.2x and 2x differences.

Profilers and timers insist that while the differences are still substantial, they come more from non-nvidia functions like memcpy than they do from nvidia functions, and if there's something that's slowing down the memory bus, it's not surprising that it would affect the nvidia functions too.

So I'm ready to say that this is more likely a HW or OS issue than anything in nvmpi.

bmegli commented 1 year ago

@madsciencetist

@bmegli yes, I tweaked the nvpmodels to fix the clock speeds equally

That's even more than I meant but probably even better.

In general measuring CPU/GPU usage requires fixing clock speed (typically to max) so that scaling governor doesn't change the CPU/GPU frequency as reported CPU/GPU usage is relative to running frequency.

Without that:

high loads are roughly correctly represented (governor pushes the frequencies towards max)
low loads are over represented (governor pushes the frequencies towards min, low CPU usage appears relatively higher)
other things running on the system affect the result (may push CPU frequency higher changing the relative measurement)

The same is true for Windows and task manager

e.g. changing laptop plan to performance/plugging charger
makes task manager reported CPU/GPU usage drop
more visible for lower loads that would not bump the frequencies to max alone

Keylost / jetson-ffmpeg

WITH_NVUTILS(maybe?) uses 20% more CPU #11