Open alvoron opened 6 months ago
I prepared a benchdnn reference reproducer and checked it on Ampere server.
Benchdnn
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_RULE_MESSAGES=OFF -DONEDNN_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)
./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
The last benchdnn command gives me min(ms):0.673584 avg(ms):0.787748
on Ampere.
ACL
scons neon=1 opencl=0 openmp=1 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True validation_tests=1
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17
LD_LIBRARY_PATH=ComputeLibrary/build ./acl_pooling
The last command gives me 2086
on Ampere.
Hi @alvoron
Could you please try rebuilding the library with openmp=1 cppthreads=0
?
Hope this helps
I rebuilt ACL:
arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '0', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'
and got 2072
on Ampere, so the issue still remains.
P.S.
Also I wasn't able to build ACL with validation_tests=1
and openmp=1
because of undefined reference issue:
/usr/bin/ld: build/tests/validation/UNIT/CPPScheduler.o: in function `UNITSuite::CPPSchedulerSuite::RethrowException::do_run()':
CPPScheduler.cpp:(.text+0xd0): undefined reference to `arm_compute::CPPScheduler::CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x150): undefined reference to `arm_compute::CPPScheduler::set_num_threads(unsigned int)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x160): undefined reference to `arm_compute::CPPScheduler::schedule(arm_compute::ICPPKernel*, arm_compute::IScheduler::Hints const&)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x4a4): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x59c): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x684): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
That's why I set validation_tests=0
and deleted inputt
filling logic from the reproducer. I believe it shouldn't affect the performance.
Hi @alvoron
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?
Hi @alvoron
I made some changes to our validation suite to assess the performance, see the results below, neon backend is much faster than our reference code.
ComputeLibrary % ./build/tests/arm_compute_validation "--filter=.*Pooling3d.*" --mode=NIGHTLY --threads=4
...
Running [337] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 873
reference time: 50789
Wall clock/Wall clock time: AVG=32352.0000 us
Running [338] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 1006
reference time: 56723
Wall clock/Wall clock time: AVG=38709.0000 us
Running [339] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 1049
reference time: 56795
Wall clock/Wall clock time: AVG=38914.0000 us
Running [340] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 918
reference time: 51994
Wall clock/Wall clock time: AVG=34195.0000 us
Running [341] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 934
reference time: 51818
Wall clock/Wall clock time: AVG=34168.0000 us
Running [342] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 661
reference time: 21681
Wall clock/Wall clock time: AVG=7178.0000 us
Running [343] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 662
reference time: 21722
Wall clock/Wall clock time: AVG=7316.0000 us
Running [344] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 733
reference time: 25640
Wall clock/Wall clock time: AVG=8681.0000 us
Running [345] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 704
reference time: 25464
Wall clock/Wall clock time: AVG=8755.0000 us
Running [346] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 648
reference time: 22707
Wall clock/Wall clock time: AVG=7663.0000 us
Running [347] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 661
reference time: 22717
Wall clock/Wall clock time: AVG=7742.0000 us
Running [348] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
Hi @alvoron
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?
May we refer to benchdnn
results as to reference one? I repeated benchdnn and ACL commands again on Ampere and I got average 2.3-2.6 ms using ACL reproducer and average 0.9 ms using benchdnn.
I assume, my benchdnn command equals to ACL kernel configuration. Please let me know if I missed something.
Hi @alvoron
I tried this on Ampere altra using benchdnn
.
See below, the output of the build without ACL: total perf: min(ms):2.14062 avg(ms):5.18986
pabtel01@ampere-01 build_noacl]$ ./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,simple_nhwc:any,,--mode=P --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1,0,0.469971,2.14062,0,5.18986,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):2.14062 avg(ms):5.18986
total: 3.05s; fill: 0.03s (1%);
Can you please double check on your side?
Hi @alvoron I made some changes to the reproducer to print the time used for each iteration and then compute the average. I ran the ACL reproducer on Ampere altra and I see performance AVG ranging from [3584,7000] μs
Please see below
LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./acl_pooling
i 0 time: 3047
i 1 time: 3393
i 2 time: 2866
i 3 time: 3601
i 4 time: 3060
i 5 time: 2984
i 6 time: 4200
i 7 time: 3540
i 8 time: 3782
i 9 time: 3690
i 10 time: 3055
i 11 time: 3504
i 12 time: 3530
i 13 time: 3144
i 14 time: 3507
i 15 time: 3632
i 16 time: 3504
i 17 time: 3519
i 18 time: 3515
i 19 time: 3474
i 20 time: 3514
i 21 time: 3467
i 22 time: 3605
i 23 time: 3518
i 24 time: 3409
i 25 time: 3337
i 26 time: 2815
i 27 time: 3278
i 28 time: 3125
i 29 time: 3520
i 30 time: 3581
i 31 time: 2903
i 32 time: 2951
i 33 time: 3142
i 34 time: 3332
i 35 time: 3572
i 36 time: 3045
i 37 time: 3364
i 38 time: 3271
i 39 time: 3456
i 40 time: 3375
i 41 time: 3441
i 42 time: 3329
i 43 time: 3298
i 44 time: 3345
i 45 time: 3336
i 46 time: 3488
i 47 time: 3280
i 48 time: 3362
i 49 time: 3566
i 50 time: 3579
i 51 time: 3614
i 52 time: 3507
i 53 time: 3619
i 54 time: 3468
i 55 time: 3576
i 56 time: 3549
i 57 time: 3467
i 58 time: 3557
i 59 time: 3532
i 60 time: 3481
i 61 time: 3499
i 62 time: 3541
i 63 time: 3591
i 64 time: 3572
i 65 time: 3552
i 66 time: 3342
i 67 time: 3565
i 68 time: 3437
i 69 time: 3610
i 70 time: 3532
i 71 time: 3425
i 72 time: 3620
i 73 time: 3633
i 74 time: 3613
i 75 time: 3661
i 76 time: 3566
i 77 time: 3598
i 78 time: 3761
i 79 time: 3866
i 80 time: 3790
i 81 time: 3662
i 82 time: 3559
i 83 time: 3380
i 84 time: 3888
i 85 time: 3589
i 86 time: 3569
i 87 time: 3695
i 88 time: 3543
i 89 time: 3603
i 90 time: 3536
i 91 time: 3581
i 92 time: 3715
i 93 time: 3365
i 94 time: 3648
i 95 time: 3260
i 96 time: 3609
i 97 time: 7832
i 98 time: 6981
i 99 time: 7144
total: 3584
Hope this helps
@morgolock oneDNN without ACL gives me 0.8 ms avg
./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,simple_nhwc:any,,--mode=P --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1,0,1.63257,0.69165,0,0.837442,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.69165 avg(ms):0.837442
total: 3.06s; fill: 0.03s (1%);
Below is a single test run with DNNL_VERBOSE
enabled:
DNNL_VERBOSE=2 ./build/tests/benchdnn/benchdnn --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
onednn_verbose,v1,info,oneDNN v3.6.1 (commit e72f65d70e36f552f902d35614aef7aa54f3c796)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,v1,info,cpu,isa:AArch64 (with Advanced SIMD & floating-point)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,0.0651855
onednn_verbose,v1,primitive,create:cache_hit,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,0.00195312
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:abcde::f0 dst:f32::blocked:acdeb::f0,,,1x192x40x28x28,0.512939
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcde::f0 dst:f32::blocked:acdeb::f0,,,1x192x40x28x28,3.46313
onednn_verbose,v1,primitive,exec,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,1.80005
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:acdeb::f0 dst:f32::blocked:abcde::f0,,,1x192x40x28x28,0.156982
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:acdeb::f0 dst:f32::blocked:abcde::f0,,,1x192x40x28x28,0.478027
0:PASSED __REPRO: --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.12s; fill: 0.03s (23%); compute_ref: 0.05s (37%); compare: 0.02s (16%);
I'm wondering why your oneDNN results is much slower than mine: 5.19 ms vs 0.8 ms.
I'm using oneDNN 3.6.1. I built benchdnn
in this way:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target benchdnn --parallel $(nproc)
Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v24.02.1 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'arch': 'armv8.6-a', 'Werror': 'false', 'validation_tests': '1', 'os': 'macos', 'build': 'native', 'fixed_format_kernels': '1'} Git hash=b'f2eda6665c12d568e179f5b0e7a24ccdc0ac824d'
Platform: Apple M2 Pro
Operating System: macOS 13.4
Problem description:
NEPooling3dLayer
provides twice much latency rather than reference C++ pooling implementation: 6.5 ms vs 3.5 ms.Reproducer
How reproducer was built
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Could you please review potential performance issues in
NEPooling3dLayer
?