ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn
https://developer.arm.com/products/processors/machine-learning/arm-nn
MIT License
1.16k stars 308 forks source link

Model runs slower with ARM-NN than with XNNPACK on Cortex A53 #784

Open answerdon opened 3 weeks ago

answerdon commented 3 weeks ago

I have experimented multiple models with ARM-NN on Cortex A53(mostly int8 quantized models with latency < 200ms). And I found XNNPACK generally gives a better latency result than ARM-NN. So I am trying to understand what kind of model can perform better with ARM-NN.

For example, I compared the results using the mobilenet model downloaded from ARMNN model zoo: https://github.com/ARM-software/ML-zoo/tree/master/models/image_classification/mobilenet_v2_1.0_224/tflite_int8

./benchmark_model --graph=./mobilenet_v2_1.0_224_INT8.tflite --external_delegate_path=./libarmnnDelegate.so --external_delegate_options="backends:CpuAcc;disable-tflite-runtime-fallback:true;number-of-threads:1"
Log parameter values verbosely: [0]
Graph: [./mobilenet_v2_1.0_224_INT8.tflite]
External delegate path: [./libarmnnDelegate.so]
External delegate options: [backends:CpuAcc,CpuRef;disable-tflite-runtime-fallback:true;number-of-threads:1]
Loaded model ./mobilenet_v2_1.0_224_INT8.tflite
INFO: Initialized TensorFlow Lite runtime.
Couldn't find any of the following OpenCL library: libOpenCL.so libGLES_mali.so libmali.so 
INFO: TfLiteArmnnDelegate: Added backend CpuAcc
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
EXTERNAL delegate created.
VERBOSE: Replacing 66 node(s) with delegate (TfLiteArmNnDelegate) node, yielding 1 partitions for the whole graph.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 4.02094
Initialized session in 287.252ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=2 first=468655 curr=159104 min=159104 max=468655 avg=313880 std=154775

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=177554 curr=131598 min=131528 max=177554 avg=134539 std=6398

Inference timings in us: Init: 287252, First inference: 468655, Warmup (avg): 313880, Inference (avg): 134539
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=67.3633 overall=77.9492
./benchmark_model --graph=./mobilenet_v2_1.0_224_INT8.tflite --num_threads=1
INFO: Initialized TensorFlow Lite runtime.
INFO: Applying 1 TensorFlow Lite delegate(s) lazily.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 64 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 4 partitions for the whole graph.
INFO: Successfully applied the default TensorFlow Lite delegate indexed at 0.
Num threads: [1]
Graph: [./mobilenet_v2_1.0_224_INT8.tflite]
Enable op profiling: [0]
#threads used for CPU inference: [1]
Loaded model mobilenet_v2_1.0_224_INT8.tflite
The input model file size (MB): 4.02094
Initialized session in 108.149ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=158233 curr=138142 min=138142 max=158233 avg=148234 std=7143

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=120254 curr=119630 min=119404 max=123512 avg=119935 std=722

Inference timings in us: Init: 108149, First inference: 158233, Warmup (avg): 148234, Inference (avg): 119935
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=9.45312 overall=13.9961
Colm-in-Arm commented 3 weeks ago

Hello answerdon

There are many factors that will effect the execution time and there will be cases where Arm NN does not provide improved performance. Can I suggest you try using the evaluate_network.sh script in armnn/tests/ExecuteNetwork/. This will try different parameters with ExecuteNetwork and the TfLite delegate to help you choose parameters that might improve performance.

Colm.