Open alvoron opened 1 month ago
I prepared a benchdnn reference reproducer and checked it on Ampere server.
Benchdnn
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_RULE_MESSAGES=OFF -DONEDNN_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)
./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
The last benchdnn command gives me min(ms):0.673584 avg(ms):0.787748
on Ampere.
ACL
scons neon=1 opencl=0 openmp=1 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True validation_tests=1
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17
LD_LIBRARY_PATH=ComputeLibrary/build ./acl_pooling
The last command gives me 2086
on Ampere.
Hi @alvoron
Could you please try rebuilding the library with openmp=1 cppthreads=0
?
Hope this helps
I rebuilt ACL:
arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '0', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'
and got 2072
on Ampere, so the issue still remains.
P.S.
Also I wasn't able to build ACL with validation_tests=1
and openmp=1
because of undefined reference issue:
/usr/bin/ld: build/tests/validation/UNIT/CPPScheduler.o: in function `UNITSuite::CPPSchedulerSuite::RethrowException::do_run()':
CPPScheduler.cpp:(.text+0xd0): undefined reference to `arm_compute::CPPScheduler::CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x150): undefined reference to `arm_compute::CPPScheduler::set_num_threads(unsigned int)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x160): undefined reference to `arm_compute::CPPScheduler::schedule(arm_compute::ICPPKernel*, arm_compute::IScheduler::Hints const&)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x4a4): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x59c): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x684): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
That's why I set validation_tests=0
and deleted inputt
filling logic from the reproducer. I believe it shouldn't affect the performance.
Hi @alvoron
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?
Hi @alvoron
I made some changes to our validation suite to assess the performance, see the results below, neon backend is much faster than our reference code.
ComputeLibrary % ./build/tests/arm_compute_validation "--filter=.*Pooling3d.*" --mode=NIGHTLY --threads=4
...
Running [337] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 873
reference time: 50789
Wall clock/Wall clock time: AVG=32352.0000 us
Running [338] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 1006
reference time: 56723
Wall clock/Wall clock time: AVG=38709.0000 us
Running [339] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 1049
reference time: 56795
Wall clock/Wall clock time: AVG=38914.0000 us
Running [340] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 918
reference time: 51994
Wall clock/Wall clock time: AVG=34195.0000 us
Running [341] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 934
reference time: 51818
Wall clock/Wall clock time: AVG=34168.0000 us
Running [342] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 661
reference time: 21681
Wall clock/Wall clock time: AVG=7178.0000 us
Running [343] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 662
reference time: 21722
Wall clock/Wall clock time: AVG=7316.0000 us
Running [344] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 733
reference time: 25640
Wall clock/Wall clock time: AVG=8681.0000 us
Running [345] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 704
reference time: 25464
Wall clock/Wall clock time: AVG=8755.0000 us
Running [346] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 648
reference time: 22707
Wall clock/Wall clock time: AVG=7663.0000 us
Running [347] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 661
reference time: 22717
Wall clock/Wall clock time: AVG=7742.0000 us
Running [348] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
Hi @alvoron
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?
May we refer to benchdnn
results as to reference one? I repeated benchdnn and ACL commands again on Ampere and I got average 2.3-2.6 ms using ACL reproducer and average 0.9 ms using benchdnn.
I assume, my benchdnn command equals to ACL kernel configuration. Please let me know if I missed something.
Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v24.02.1 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'arch': 'armv8.6-a', 'Werror': 'false', 'validation_tests': '1', 'os': 'macos', 'build': 'native', 'fixed_format_kernels': '1'} Git hash=b'f2eda6665c12d568e179f5b0e7a24ccdc0ac824d'
Platform: Apple M2 Pro
Operating System: macOS 13.4
Problem description:
NEPooling3dLayer
provides twice much latency rather than reference C++ pooling implementation: 6.5 ms vs 3.5 ms.Reproducer
How reproducer was built
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Could you please review potential performance issues in
NEPooling3dLayer
?