ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
2.87k stars 783 forks source link

NEPooling3dLayer performance issue #1107

Open alvoron opened 6 months ago

alvoron commented 6 months ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v24.02.1 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'arch': 'armv8.6-a', 'Werror': 'false', 'validation_tests': '1', 'os': 'macos', 'build': 'native', 'fixed_format_kernels': '1'} Git hash=b'f2eda6665c12d568e179f5b0e7a24ccdc0ac824d'

Platform: Apple M2 Pro

Operating System: macOS 13.4

Problem description: NEPooling3dLayer provides twice much latency rather than reference C++ pooling implementation: 6.5 ms vs 3.5 ms.

Reproducer

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "tests/SimpleTensor.h"
#include "arm_compute/runtime/Tensor.h"
#include "utils/TypePrinter.h"

#include "tests/Utils.h"
#include "tests/AssetsLibrary.h"
#include "tests/NEON/Accessor.h"

#include <string>
#include <chrono>

 using namespace std;
 using namespace arm_compute;
 using namespace arm_compute::test;

 int main()
 {
  DataLayout dataLayout = DataLayout::NDHWC;
  TensorShape inTensorShape = TensorShape(192, 28, 28, 40, 1);
  TensorShape outTensorShape = inTensorShape;

  Tensor inputt;
  Tensor outputt;
  inputt.allocator()->init(TensorInfo(inTensorShape, 1, DataType::F32, dataLayout));
  outputt.allocator()->init(TensorInfo(outTensorShape, 1, DataType::F32, dataLayout));

  Pooling3dLayerInfo pool3d_info;
  pool3d_info.pool_type       = PoolingType::MAX;
  pool3d_info.exclude_padding = true;
  pool3d_info.pool_size       = arm_compute::Size3D(3, 3, 3);
  pool3d_info.stride          = arm_compute::Size3D(1, 1, 1);
  pool3d_info.padding         = arm_compute::Padding3D(1, 1, 1, 1, 1, 1);
  pool3d_info.round_type      = DimensionRoundingType::FLOOR;

  NEPooling3dLayer pooling;
  pooling.configure(&inputt, &outputt, pool3d_info);

  inputt.allocator()->allocate();
  outputt.allocator()->allocate();

  AssetsLibrary library(".", std::random_device()());
  std::uniform_real_distribution<> distribution{ 0.0f, 10.0f };
  library.fill(Accessor(inputt), distribution, 0);

//warm-up
  pooling.run();

std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 100; i++) pooling.run();
std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
std::cout << "time: " << total_duration / 100 << std::endl;
}

How reproducer was built

clang++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute -lAssetsLibrary.o -lRawTensor.o -lExceptions.o -std=c++17

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Could you please review potential performance issues in NEPooling3dLayer ?

alvoron commented 6 months ago

I prepared a benchdnn reference reproducer and checked it on Ampere server.

Benchdnn

cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_RULE_MESSAGES=OFF -DONEDNN_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)
./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1

The last benchdnn command gives me min(ms):0.673584 avg(ms):0.787748 on Ampere.

ACL

scons neon=1 opencl=0 openmp=1 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True validation_tests=1
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17
 LD_LIBRARY_PATH=ComputeLibrary/build ./acl_pooling

The last command gives me 2086 on Ampere.

morgolock commented 5 months ago

Hi @alvoron

Could you please try rebuilding the library with openmp=1 cppthreads=0 ?

Hope this helps

alvoron commented 5 months ago

I rebuilt ACL:

arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '0', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'

and got 2072 on Ampere, so the issue still remains.

P.S. Also I wasn't able to build ACL with validation_tests=1 and openmp=1 because of undefined reference issue:

/usr/bin/ld: build/tests/validation/UNIT/CPPScheduler.o: in function `UNITSuite::CPPSchedulerSuite::RethrowException::do_run()':
CPPScheduler.cpp:(.text+0xd0): undefined reference to `arm_compute::CPPScheduler::CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x150): undefined reference to `arm_compute::CPPScheduler::set_num_threads(unsigned int)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x160): undefined reference to `arm_compute::CPPScheduler::schedule(arm_compute::ICPPKernel*, arm_compute::IScheduler::Hints const&)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x4a4): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x59c): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x684): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'

That's why I set validation_tests=0 and deleted inputt filling logic from the reproducer. I believe it shouldn't affect the performance.

morgolock commented 5 months ago

Hi @alvoron

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?

morgolock commented 5 months ago

Hi @alvoron

I made some changes to our validation suite to assess the performance, see the results below, neon backend is much faster than our reference code.

ComputeLibrary % ./build/tests/arm_compute_validation "--filter=.*Pooling3d.*" --mode=NIGHTLY --threads=4
...
Running [337] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 873
reference time: 50789
  Wall clock/Wall clock time:    AVG=32352.0000 us
Running [338] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 1006
reference time: 56723
  Wall clock/Wall clock time:    AVG=38709.0000 us
Running [339] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 1049
reference time: 56795
  Wall clock/Wall clock time:    AVG=38914.0000 us
Running [340] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 918
reference time: 51994
  Wall clock/Wall clock time:    AVG=34195.0000 us
Running [341] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 934
reference time: 51818
  Wall clock/Wall clock time:    AVG=34168.0000 us
Running [342] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 661
reference time: 21681
  Wall clock/Wall clock time:    AVG=7178.0000 us
Running [343] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 662
reference time: 21722
  Wall clock/Wall clock time:    AVG=7316.0000 us
Running [344] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 733
reference time: 25640
  Wall clock/Wall clock time:    AVG=8681.0000 us
Running [345] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 704
reference time: 25464
  Wall clock/Wall clock time:    AVG=8755.0000 us
Running [346] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 648
reference time: 22707
  Wall clock/Wall clock time:    AVG=7663.0000 us
Running [347] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 661
reference time: 22717
  Wall clock/Wall clock time:    AVG=7742.0000 us
Running [348] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
alvoron commented 4 months ago

Hi @alvoron

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?

May we refer to benchdnn results as to reference one? I repeated benchdnn and ACL commands again on Ampere and I got average 2.3-2.6 ms using ACL reproducer and average 0.9 ms using benchdnn.

I assume, my benchdnn command equals to ACL kernel configuration. Please let me know if I missed something.

morgolock commented 1 month ago

Hi @alvoron

I tried this on Ampere altra using benchdnn.

See below, the output of the build without ACL: total perf: min(ms):2.14062 avg(ms):5.18986

pabtel01@ampere-01 build_noacl]$ ./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,simple_nhwc:any,,--mode=P --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1,0,0.469971,2.14062,0,5.18986,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):2.14062 avg(ms):5.18986
total: 3.05s; fill: 0.03s (1%);

Can you please double check on your side?

morgolock commented 2 weeks ago

Hi @alvoron I made some changes to the reproducer to print the time used for each iteration and then compute the average. I ran the ACL reproducer on Ampere altra and I see performance AVG ranging from [3584,7000] μs

Please see below

LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./acl_pooling 
i 0 time: 3047
i 1 time: 3393
i 2 time: 2866
i 3 time: 3601
i 4 time: 3060
i 5 time: 2984
i 6 time: 4200
i 7 time: 3540
i 8 time: 3782
i 9 time: 3690
i 10 time: 3055
i 11 time: 3504
i 12 time: 3530
i 13 time: 3144
i 14 time: 3507
i 15 time: 3632
i 16 time: 3504
i 17 time: 3519
i 18 time: 3515
i 19 time: 3474
i 20 time: 3514
i 21 time: 3467
i 22 time: 3605
i 23 time: 3518
i 24 time: 3409
i 25 time: 3337
i 26 time: 2815
i 27 time: 3278
i 28 time: 3125
i 29 time: 3520
i 30 time: 3581
i 31 time: 2903
i 32 time: 2951
i 33 time: 3142
i 34 time: 3332
i 35 time: 3572
i 36 time: 3045
i 37 time: 3364
i 38 time: 3271
i 39 time: 3456
i 40 time: 3375
i 41 time: 3441
i 42 time: 3329
i 43 time: 3298
i 44 time: 3345
i 45 time: 3336
i 46 time: 3488
i 47 time: 3280
i 48 time: 3362
i 49 time: 3566
i 50 time: 3579
i 51 time: 3614
i 52 time: 3507
i 53 time: 3619
i 54 time: 3468
i 55 time: 3576
i 56 time: 3549
i 57 time: 3467
i 58 time: 3557
i 59 time: 3532
i 60 time: 3481
i 61 time: 3499
i 62 time: 3541
i 63 time: 3591
i 64 time: 3572
i 65 time: 3552
i 66 time: 3342
i 67 time: 3565
i 68 time: 3437
i 69 time: 3610
i 70 time: 3532
i 71 time: 3425
i 72 time: 3620
i 73 time: 3633
i 74 time: 3613
i 75 time: 3661
i 76 time: 3566
i 77 time: 3598
i 78 time: 3761
i 79 time: 3866
i 80 time: 3790
i 81 time: 3662
i 82 time: 3559
i 83 time: 3380
i 84 time: 3888
i 85 time: 3589
i 86 time: 3569
i 87 time: 3695
i 88 time: 3543
i 89 time: 3603
i 90 time: 3536
i 91 time: 3581
i 92 time: 3715
i 93 time: 3365
i 94 time: 3648
i 95 time: 3260
i 96 time: 3609
i 97 time: 7832
i 98 time: 6981
i 99 time: 7144
total: 3584

Hope this helps

alvoron commented 1 week ago

@morgolock oneDNN without ACL gives me 0.8 ms avg

./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,simple_nhwc:any,,--mode=P --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1,0,1.63257,0.69165,0,0.837442,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.69165 avg(ms):0.837442
total: 3.06s; fill: 0.03s (1%);

Below is a single test run with DNNL_VERBOSE enabled:

DNNL_VERBOSE=2 ./build/tests/benchdnn/benchdnn --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
onednn_verbose,v1,info,oneDNN v3.6.1 (commit e72f65d70e36f552f902d35614aef7aa54f3c796)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,v1,info,cpu,isa:AArch64 (with Advanced SIMD & floating-point)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,0.0651855
onednn_verbose,v1,primitive,create:cache_hit,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,0.00195312
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:abcde::f0 dst:f32::blocked:acdeb::f0,,,1x192x40x28x28,0.512939
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcde::f0 dst:f32::blocked:acdeb::f0,,,1x192x40x28x28,3.46313
onednn_verbose,v1,primitive,exec,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,1.80005
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:acdeb::f0 dst:f32::blocked:abcde::f0,,,1x192x40x28x28,0.156982
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:acdeb::f0 dst:f32::blocked:abcde::f0,,,1x192x40x28x28,0.478027
0:PASSED __REPRO: --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.12s; fill: 0.03s (23%); compute_ref: 0.05s (37%); compare: 0.02s (16%);

I'm wondering why your oneDNN results is much slower than mine: 5.19 ms vs 0.8 ms. I'm using oneDNN 3.6.1. I built benchdnn in this way:

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target benchdnn --parallel $(nproc)