ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn
https://developer.arm.com/products/processors/machine-learning/arm-nn
MIT License
1.17k stars 309 forks source link

ARM-NN v22.02 based library is taking more processing time than v21.05 based library #635

Closed supratimc239 closed 11 months ago

supratimc239 commented 2 years ago

Hi,

Recently I have upgraded my armnn and compute library to v22.02 and as per v22.02 release note upgraded tensorflow to use v2.5.0. But when I executed my fp32 TF-lite model (using ExecutionNetwork framework) I see that there is an increase in processing time with or without applying task affinity. I am using CpuAcc backend and running on a device using Android 12. ARMNN libraries are built using Android NDK version r20b.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | v21.05 | v22.02 | Increase % -- | -- | -- | -- Small | 878.29 | 1295.56 | 47.50936479 Medium | 337.87 | 390.9 | 15.6953858 Big | 234.64 | 283.72 | 20.91714968 No Affinity | 245.72 | 317.33 | 29.14292691

Apart from upgrading armnn, compute library and tensor flow, I had to also do changes to resolve run time error "terminating with uncaught exception of type armnn::LayerValidationException: Unspecified dimension while using ShapeInferenceMethod::ValidateOnly". I followed suggestion from https://githubhot.com/repo/ARM-software/armnn/issues/619 and did following changes: armnnTfLiteParser::ITfLiteParser::TfLiteParserOptions parserOption; parserOption.m_InferAndValidate = true; auto parser(armnnTfLiteParser::ITfLiteParser::Create(parserOption));`

armnn::OptimizerOptions optOptions; optOptions.m_shapeInferenceMethod = armnn::ShapeInferenceMethod::InferAndValidate; armnn::IOptimizedNetworkPtr optNet{nullptr, [](armnn::IOptimizedNetwork ){}}; optNet = armnn::Optimize(network, {armnn::Compute::CpuAcc}, m_Runtime->GetDeviceSpec(), optOptions);

If above change cannot be the reason, then could you please let me know what could be other reasons of this increase in processing time?

Thanks

supratimc239 commented 2 years ago

Hi,

Today I have done further analysis after enabling profiling and following are few observations:

  1. There are 4 extra layers for NeonConstantWorkload_Execute under "Execute". These layers don't impact the processing time much. But it would be good to know why they have been added in v22.02.
  2. I can see there is a new field "GUID" added in the profiling, which was not present in v21.05. Could you please tell what this is used for?
  3. There is a considerable difference in processing time between v21.05 and v22.02 which could be the reason behind increase in overall processing time. Please find below some of the delta (this is for the case where we don't force CPU core affinity). <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Processing time in us | v22.02 | v21.05 | % increase -- | -- | -- | -- CpuIm2ColKernel | 121.688 | 100.2639 | 21.37% CpuGemmAssemblyWrapperKernel | 81.69389 | 70.7343 | 15.49% CpuPool2dAssemblyWrapperKernel | 17.46845 | 15.49872 | 12.71% CpuWinogradConv2dTransformInputKernel | 12.49053 | 8.720859 | 43.23% CpuGemmAssemblyWrapperKernel | 56.68998 | 39.18841 | 44.66% CpuWinogradConv2dTransformOutputKernel | 9.268111 | 7.09798 | 30.57%

  1. Though I have provided only first few kernel delta in processing time, the increase in processing time seems to be consistent across the kernels.

Please find below the Computelibrary and armnn build commands: scons arch=arm64-v8a neon=1 opencl=1 embed_kernels=1 extra_cxx_flags="-fPIC" \ benchmark_tests=0 validation_tests=0 os=android -j16

CXX=aarch64-linux-android-clang++ \ CC=aarch64-linux-android-clang \ CXX_FLAGS="-fPIE -fPIC" \ cmake .. \ -DCMAKE_ANDROID_NDK=$NDK \ -DCMAKE_SYSTEM_NAME=Android \ -DCMAKE_SYSTEM_VERSION=29 \ -DCMAKE_ANDROID_ARCH_ABI=arm64-v8a \ -DCMAKE_EXE_LINKER_FLAGS="-pie -llog -lz" \ -DARMCOMPUTE_ROOT=$HOME/vad/vad-armnn/ComputeLibrary/ \ -DARMCOMPUTE_BUILD_DIR=$HOME/vad/vad-armnn/ComputeLibrary/build \ -DBUILD_TF_LITE_PARSER=1 \ -DTENSORFLOW_ROOT=$HOME/vad/vad-armnn/vad-google-packages/tensorflow \ -DTF_LITE_GENERATED_PATH=$HOME/vad/vad-armnn/tflite \ -DFLATBUFFERS_ROOT=$HOME/vad/vad-armnn/flatbuffers-arm64 \ -DFLATC_DIR=$HOME/vad/vad-armnn/flatbuffers-1.12.0/build/ \ -DARMCOMPUTENEON=1 -DARMCOMPUTECL=0 -DARMNNREF=1 \ -DBUILD_TESTS=1

Also, these are the options I am using while creating network in my source code: _armnnTfLiteParser::ITfLiteParser::TfLiteParserOptions parserOption; parserOption.m_InferAndValidate = true; auto parser(armnnTfLiteParser::ITfLiteParser::Create(parserOption));

auto backendOptions = armnn::BackendOptions{"CpuAcc", { {"TuningLevel", 0}, {"TuningFile", "/data/local/tmp/tests/models/tuning.cfg"} } }; options.m_BackendOptions.emplace_back(backendOptions);

armnn::OptimizerOptions optOptions; optOptions.m_shapeInferenceMethod = armnn::ShapeInferenceMethod::InferAndValidate;

armnn::IOptimizedNetworkPtr optNet{nullptr, [](armnn::IOptimizedNetwork ){}}; optNet = armnn::Optimize(network, {armnn::Compute::CpuAcc}, m_Runtime->GetDeviceSpec(), optOptions);

std::string ignoredErrorMessage; armnn::INetworkProperties networkProperties(false, armnn::MemorySource::Malloc, armnn::MemorySource::Malloc);

Question:

  1. Am I missing some compilation option or options in the source code to at least be as good as v21.05?
  2. At your end do you always see v22.02 performing faster than v21.05?

Thanks

supratimc239 commented 2 years ago

Hi Folks,

Do you have any suggestions on how to make ARMNN v22.02 faster than ARMNN v21.05? I would like to use the latest and greatest version of ARMNN and Compute library for my work, but unfortunately due to high processing time I have continue using ARMNN v21.05.

Any help will be appreciated.

Thanks

MatthewARM commented 2 years ago

@morgolock?

morgolock commented 2 years ago

Hi @supratimc239

It would be good if you could provide more information:

Processing time in us v22.02 v21.05 % increase CpuIm2ColKernel 121.688 100.2639 21.37% CpuGemmAssemblyWrapperKernel 81.69389 70.7343 15.49% CpuPool2dAssemblyWrapperKernel 17.46845 15.49872 12.71% CpuWinogradConv2dTransformInputKernel 12.49053 8.720859 43.23% CpuGemmAssemblyWrapperKernel 56.68998 39.18841 44.66% CpuWinogradConv2dTransformOutputKernel 9.268111 7.09798 30.57%

We test each release to make sure there are no performance regressions, we have not spotted any regressions in these assembly kernels. To investigate this it's important that you share the additional information mentioned above.

When reading the profiler information please make sure you ignore the first iteration which is always slower than the others due to the initial startup overhead. Do you see the same slowdown in all other iterations?

Hope this helps.

MatthewARM commented 2 years ago

I've just checked the internal performance tracking tests within Arm, we did see a small regression in some test cases from 21.05 to 21.08, and those were all fixed in 21.11. We haven't spotted any regressions from 21.11 to 22.02.

What data type(s) does the model use?

supratimc239 commented 2 years ago

Hi @morgolock and @MatthewARM Please find below answer to your questions:

  1. I am using a proprietary model and unfortunately cannot share it for confidentiality reasons.
  2. Please find below CPU information: processor : 0 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : Cortex-A510 CPU revision : 2

processor : 1 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : Cortex-A510 CPU revision : 2

processor : 2 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : Cortex-A510 CPU revision : 2

processor : 3 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : Cortex-A510 CPU revision : 2

processor : 4 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x2 CPU part : Cortex-A710 CPU revision : 0

processor : 5 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x2 CPU part : Cortex-A710 CPU revision : 0

processor : 6 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x2 CPU part : Cortex-A710 CPU revision : 0

processor : 7 BogoMIPS : 51.20 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x2 CPU part : Cortex-X2 CPU revision : 0

  1. Android version: 12
  2. For both input and output float32 data type is used.

Please have a look at the cmake options I am using and see if I am doing anything wrong. Modem performance on ARMNN 22.02 is consistently poor in comparison to version 21.05.

Thanks

MikeJKelly commented 11 months ago

Closing as we're not seeing any significant regressions on recent releases (we're now on 23.08) and cannot do much without knowing the layers and datatypes that are showing regressions. If you need help please open a new issue.