Keylost / jetson-ffmpeg

ffmpeg support on jetson nano
Other
64 stars 24 forks source link

WITH_NVUTILS(maybe?) uses 20% more CPU #11

Closed madsciencetist closed 1 year ago

madsciencetist commented 1 year ago

I have two test setups, both with the same nvpmodel clock speeds:

  1. Xavier AGX with Jetpack 4.6.1 (WITH_NVUTILS not defined)
  2. Xavier NX with Jetpack 5.0.2 (WITH_NVUTILS defined)

Unfortunately I can't fill out the test matrix more because it it not easy for me to reflash either one, and ffmpeg crashes on the NX when I manually disable WITH_NVUTILS, which is its own bug. So I'm not really sure if the following differences are due to hardware, OS, or use of nvutils, and I'm hoping someone else can confirm or deny some of these numbers.

First, as a control, to rule out CPU speed differences, I use the software decoder, and show that the NX does not use more CPU:

Switching to the NVMPI decoder, we see that the NX is using substantially more CPU:

This 1.7x difference grows to 5x as we speed up other steps:

Does NvUtils use 5x the CPU of nvbuf_utils? Does Jetpack 5.0 use 5x the CPU of Jetpack 4.6? Profiling both did not give me any clear answers. Is anyone able to test Jetpack 5.0 with nvbuf_utils? Trying to figure out if this is a me problem, nvmpi problem or nvidia problem.

bmegli commented 1 year ago

First, as a control, to rule out CPU speed differences

Be careful with measuring CPU usage as this is relative to current CPU clock (top, htop and similar tools).

To compare different loads you always have to fix the CPU clock speed.

The easiest way on Jetson is jetson_clocks


This is especially important when comparing low loads where CPU governor may lower clock speed which results in higher reported load.

madsciencetist commented 1 year ago

@bmegli yes, I tweaked the nvpmodels to fix the clock speeds equally

madsciencetist commented 1 year ago

Turns out my NVDEC and EMC clocks were still running at different speeds, and my NX had a background process using the memory bus somewhat heavily. After fixing all that, the 1.7x and 5x differences reduced to 1.2x and 2x differences.

Profilers and timers insist that while the differences are still substantial, they come more from non-nvidia functions like memcpy than they do from nvidia functions, and if there's something that's slowing down the memory bus, it's not surprising that it would affect the nvidia functions too.

So I'm ready to say that this is more likely a HW or OS issue than anything in nvmpi.

bmegli commented 1 year ago

@madsciencetist

@bmegli yes, I tweaked the nvpmodels to fix the clock speeds equally

That's even more than I meant but probably even better.


In general measuring CPU/GPU usage requires fixing clock speed (typically to max) so that scaling governor doesn't change the CPU/GPU frequency as reported CPU/GPU usage is relative to running frequency.

Without that:


The same is true for Windows and task manager