ArmNN v22.02 GpuAcc backend slower than CpuAcc with Mali-G31

JavaBatista commented 2 years ago

I was running the python example in the TfLite Delegate Quick Start Guide with different backends and noticed that example run slower with GpuAcc backend than CpuAcc backend.

With CpuAcc backend I get the following output:

Info: ArmNN v28.0.0
Info: Initialization time: 25.27 ms.
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
Warning: The backend makes use of a deprecated interface to read constant tensors. If you are a backend developer please find more information in our doxygen documentation on github https://github.com/ARM-software/armnn under the keyword 'ConstTensorsAsInputs'.
Info: ArmnnSubgraph creation
Info: Parse nodes to ArmNN time: 0.44 ms
Warning: The backend makes use of a deprecated interface to read constant tensors. If you are a backend developer please find more information in our doxygen documentation on github https://github.com/ARM-software/armnn under the keyword 'ConstTensorsAsInputs'.
Info: Optimize ArmnnSubgraph time: 2.63 ms
Info: Load ArmnnSubgraph time: 10.07 ms
Info: Overall ArmnnSubgraph creation time: 13.88 ms

Info: Execution time: 5.07 ms.
[[ 12 123  16  12  11  14  20  16  20  12]]
Info: Shutdown time: 3.83 ms.

With GpuAcc backend I get:

Info: ArmNN v28.0.0
Info: Initialization time: 24.48 ms.
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
Warning: The backend makes use of a deprecated interface to read constant tensors. If you are a backend developer please find more information in our doxygen documentation on github https://github.com/ARM-software/armnn under the keyword 'ConstTensorsAsInputs'.
Info: ArmnnSubgraph creation
Info: Parse nodes to ArmNN time: 0.44 ms
Warning: The backend makes use of a deprecated interface to read constant tensors. If you are a backend developer please find more information in our doxygen documentation on github https://github.com/ARM-software/armnn under the keyword 'ConstTensorsAsInputs'.
Info: Optimize ArmnnSubgraph time: 3.64 ms
Info: Load ArmnnSubgraph time: 13859.76 ms
Info: Overall ArmnnSubgraph creation time: 13864.53 ms

Info: Execution time: 83.48 ms.
[[ 12 122  15  12  10  13  20  15  20  12]]
Info: Shutdown time: 62.97 ms.

The Load ArmnnSubgraph time with GpuAcc take much longer. Is this normal behavior or do I have a problem?

MatthewARM commented 2 years ago

Hi @JavaBatista, the GpuAcc backend uses the Arm Compute Library, which uses OpenCL on the GPU. The OpenCL code used by the NN model will need to be compiled for your specific GPU, which is why the "Load ArmnnSubgraph" will take a lot longer than for CPU.

In your case you're also seeing the "Execution time" much higher on GPU. There could be two reasons for this. 1) The example only performs a single inference. A real application would normally perform many inferences, which means that more benefit would be seen from the GPU caches etc. 2) It's very likely that your CPU is more powerful than your GPU for ML. Arm Mali-G31 is an ultra-efficient GPU for cost-constrained devices. You don't mention which CPU you are using, but if you have something like 2 or 4 Cortex-A55 cores, then you can reasonably expect ML to run faster on the CPU.

JavaBatista commented 2 years ago

Thanks @MatthewARM your reply cleared some of my doubts.

I'm using a Quad-core Cortex-A35 CPU. In my case the higher execution time is acceptable I just need to offload the CPU while performing inference. But CPU usage is higher with GpuAcc backend mainly during Load ArmnnSubgraph phase. I will look into the OpenCL as you mentioned.

Best Regards, Javã

catcor01 commented 1 year ago

Hello @JavaBatista,

I'm closing this due to inactivity and I think your problem has been solved. If you still need help with this then please create a new ticket.

Kind Regards, Cathal.

ARM-software / armnn

ArmNN v22.02 GpuAcc backend slower than CpuAcc with Mali-G31 #633