ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn
https://developer.arm.com/products/processors/machine-learning/arm-nn
MIT License
1.14k stars 307 forks source link

Profile execution time on NPU #668

Closed Rahn80643 closed 1 year ago

Rahn80643 commented 1 year ago

HI,

I'm trying to evaluate the execution speed of ArmNN inference on a NPU, I added the code snippet of profilerManager according to ProfilerTests.cpp:

// assign armnn::EthosNBackendID() to armnn::Optimize
chrono_time_start = std::chrono::high_resolution_clock::now();

armnn::ProfilerManager& profilerManager = armnn::ProfilerManager::GetInstance();
std::unique_ptr<armnn::IProfiler> profiler = std::make_unique<armnn::IProfiler>();
profilerManager.RegisterProfiler(profiler.get());
profiler->EnableProfiling(true);

runtime->EnqueueWorkload(networkId, nnInputTensor, nnOutputTensor);

output = boost::get<std::vector<int>>(outputDataContainers[0]);
profiler->Print(std::cout);
profiler->EnableProfiling(false);

chrono_time_end = std::chrono::high_resolution_clock::now();

For comparison, I also added std::chrono::high_resolution_clock::now() to evaluate the execution time, but the execution times from armnn profiler and std::chrono are close.

from profiler: 38.12 ms from chrono: 38.576 ms

292541567_775766180539265_5468353860106647681_n

As far as I'm concerned, std::chrono evaluates the time from CPU, and it might be unable to evaluate the execution time of NPU, and armnn profiler could be used to evaluate the execution time on NPU or GPU.

I want to ask are the times evaluated above are reasonable for NPU? Is there other functions could be used to evaluate the performance of NPU?

Best Regards, Rahn

MatthewARM commented 1 year ago

6010440 clock cycles / 0.038s indicates 158169473 clocks/sec ie. roughly 158Mhz, is that the correct clock speed of the NPU in your system?

MatthewARM commented 1 year ago

Hmm, from the kernel messages, 1829.964914-1829.957455 = 0.007459 ie. 7.5ms, is the actual length of inference on the NPU. And 6010440 clocks / 0.0075s = 801,392,000 or roughly 800Mhz, which might sound more reasonable.

Also from the kernel messages, 17ms (1829.957455 - 1829.940628) was spent resetting the NPU (could this be due to waking it up from a low power state?)

I would suggest running several inferences in quick succession as there should be less time spent "waking things up" after the first inference.

The ArmNN "event" profiler will just measure elapsed time, the same as std::chrono. @eleanorbonnici-arm are you able to help further?

eleanor-arm commented 1 year ago

Thank you for your question. The recommended frequency for the NPU hardware is 1 GHz so 800 MHz looks close to real hardware performance. Depending on the platform you're running the inference on this might be a meaningful number.

Let us know if that helps

keidav01 commented 1 year ago

Hi @Rahn80643, do you require further assistance? Otherwise I will close this ticket. Thank you