ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

NEPooling3dLayer performance issue #1107

Open alvoron opened 1 month ago

alvoron commented 1 month ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v24.02.1 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'arch': 'armv8.6-a', 'Werror': 'false', 'validation_tests': '1', 'os': 'macos', 'build': 'native', 'fixed_format_kernels': '1'} Git hash=b'f2eda6665c12d568e179f5b0e7a24ccdc0ac824d'

Platform: Apple M2 Pro

Operating System: macOS 13.4

Problem description: NEPooling3dLayer provides twice much latency rather than reference C++ pooling implementation: 6.5 ms vs 3.5 ms.

Reproducer

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "tests/SimpleTensor.h"
#include "arm_compute/runtime/Tensor.h"
#include "utils/TypePrinter.h"

#include "tests/Utils.h"
#include "tests/AssetsLibrary.h"
#include "tests/NEON/Accessor.h"

#include <string>
#include <chrono>

 using namespace std;
 using namespace arm_compute;
 using namespace arm_compute::test;

 int main()
 {
  DataLayout dataLayout = DataLayout::NDHWC;
  TensorShape inTensorShape = TensorShape(192, 28, 28, 40, 1);
  TensorShape outTensorShape = inTensorShape;

  Tensor inputt;
  Tensor outputt;
  inputt.allocator()->init(TensorInfo(inTensorShape, 1, DataType::F32, dataLayout));
  outputt.allocator()->init(TensorInfo(outTensorShape, 1, DataType::F32, dataLayout));

  Pooling3dLayerInfo pool3d_info;
  pool3d_info.pool_type       = PoolingType::MAX;
  pool3d_info.exclude_padding = true;
  pool3d_info.pool_size       = arm_compute::Size3D(3, 3, 3);
  pool3d_info.stride          = arm_compute::Size3D(1, 1, 1);
  pool3d_info.padding         = arm_compute::Padding3D(1, 1, 1, 1, 1, 1);
  pool3d_info.round_type      = DimensionRoundingType::FLOOR;

  NEPooling3dLayer pooling;
  pooling.configure(&inputt, &outputt, pool3d_info);

  inputt.allocator()->allocate();
  outputt.allocator()->allocate();

  AssetsLibrary library(".", std::random_device()());
  std::uniform_real_distribution<> distribution{ 0.0f, 10.0f };
  library.fill(Accessor(inputt), distribution, 0);

//warm-up
  pooling.run();

std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 100; i++) pooling.run();
std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
std::cout << "time: " << total_duration / 100 << std::endl;
}

How reproducer was built

clang++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute -lAssetsLibrary.o -lRawTensor.o -lExceptions.o -std=c++17

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Could you please review potential performance issues in NEPooling3dLayer ?

alvoron commented 1 month ago

I prepared a benchdnn reference reproducer and checked it on Ampere server.

Benchdnn

cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_RULE_MESSAGES=OFF -DONEDNN_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)
./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1

The last benchdnn command gives me min(ms):0.673584 avg(ms):0.787748 on Ampere.

ACL

scons neon=1 opencl=0 openmp=1 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True validation_tests=1
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17
 LD_LIBRARY_PATH=ComputeLibrary/build ./acl_pooling

The last command gives me 2086 on Ampere.

morgolock commented 1 month ago

Hi @alvoron

Could you please try rebuilding the library with openmp=1 cppthreads=0 ?

Hope this helps

alvoron commented 1 month ago

I rebuilt ACL:

arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '0', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'

and got 2072 on Ampere, so the issue still remains.

P.S. Also I wasn't able to build ACL with validation_tests=1 and openmp=1 because of undefined reference issue:

/usr/bin/ld: build/tests/validation/UNIT/CPPScheduler.o: in function `UNITSuite::CPPSchedulerSuite::RethrowException::do_run()':
CPPScheduler.cpp:(.text+0xd0): undefined reference to `arm_compute::CPPScheduler::CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x150): undefined reference to `arm_compute::CPPScheduler::set_num_threads(unsigned int)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x160): undefined reference to `arm_compute::CPPScheduler::schedule(arm_compute::ICPPKernel*, arm_compute::IScheduler::Hints const&)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x4a4): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x59c): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x684): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'

That's why I set validation_tests=0 and deleted inputt filling logic from the reproducer. I believe it shouldn't affect the performance.

morgolock commented 1 week ago

Hi @alvoron

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?

morgolock commented 1 week ago

Hi @alvoron

I made some changes to our validation suite to assess the performance, see the results below, neon backend is much faster than our reference code.

ComputeLibrary % ./build/tests/arm_compute_validation "--filter=.*Pooling3d.*" --mode=NIGHTLY --threads=4
...
Running [337] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 873
reference time: 50789
  Wall clock/Wall clock time:    AVG=32352.0000 us
Running [338] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 1006
reference time: 56723
  Wall clock/Wall clock time:    AVG=38709.0000 us
Running [339] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 1049
reference time: 56795
  Wall clock/Wall clock time:    AVG=38914.0000 us
Running [340] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 918
reference time: 51994
  Wall clock/Wall clock time:    AVG=34195.0000 us
Running [341] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 934
reference time: 51818
  Wall clock/Wall clock time:    AVG=34168.0000 us
Running [342] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 661
reference time: 21681
  Wall clock/Wall clock time:    AVG=7178.0000 us
Running [343] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 662
reference time: 21722
  Wall clock/Wall clock time:    AVG=7316.0000 us
Running [344] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 733
reference time: 25640
  Wall clock/Wall clock time:    AVG=8681.0000 us
Running [345] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 704
reference time: 25464
  Wall clock/Wall clock time:    AVG=8755.0000 us
Running [346] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 648
reference time: 22707
  Wall clock/Wall clock time:    AVG=7663.0000 us
Running [347] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 661
reference time: 22717
  Wall clock/Wall clock time:    AVG=7742.0000 us
Running [348] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
alvoron commented 8 hours ago

Hi @alvoron

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?

May we refer to benchdnn results as to reference one? I repeated benchdnn and ACL commands again on Ampere and I got average 2.3-2.6 ms using ACL reproducer and average 0.9 ms using benchdnn.

I assume, my benchdnn command equals to ACL kernel configuration. Please let me know if I missed something.