Closed GalShalif closed 5 years ago
Hi @GalShalif
Could you please share the results of lscpu
on both systems?
It would be interesting to compile the library with cppthreads=0 and openmp0, this will disable multithreading and let us now if there is a problem in the scheduling causing the performance drop.
CPU information: PX30 (4 x Cortex-A35):
RK3328 (4 x Cortex-A53):
Re-compile the ARM Compute Library without multi-threading with the compilation flags: cppthreads=0 openmp=0
Re-run the benchmark test (multiplication of a FLOAT32 matrix of size 512x512): MM=512 MN=512 MK=512 NUM_REPETITIONS=10000 time neon_sgemm_noveto
machine user system elapsed CPU GFLOPS RK3328 397.49 1.78 399.39 99% 6.72 PX30 1493.15 1.28 1494.53 99% 1.79
Hi @GalShalif
The A35 is a lower specification core than the A53, lower performance is expected.
The A35 is a lower specification core than the A53 but only 80% of A53 and not 31%:
P.S. Thanks for your help, I will contact RockChip (the manufacturer of the PX30 and RK3328) for more information regarding the NEON performance of their CPUs.
Hi @GalShalif
ACL includes some highly optimised A53 kernels and this work has not been done yet for the A35, so I think this combined with the fact that the A35 is a lower spec core than the A53 would explain the results you see.
Thanks for the update - I will forward the information to RockChip (the manufacturer of the PX30 and RK3328).
Guys any further information related to this thread, I'm facing similar issues when using NEON libraries, can anything be done to optimise this library for cortex-a35. Any optimisation tips would be useful to me....
Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v19.05 Build options: {'install_dir': '/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/target/usr', 'gles_compute': '0', 'toolchain_prefix': '/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/host/bin/aarch64-buildroot-linux-gnu-', 'os': 'linux', 'opencl': '0', 'neon': '1', 'benchmark_tests': '1', 'validation_tests': '1', 'build': 'cross_compile', 'debug': '0', 'arch': 'arm64-v8a', 'Werror': '1', 'examples': '1'} Git hash=91b48ae406eadf0eae4296f001276ae17222dcea
Platform: RockChip PX30 4 x Cortex-a35 1.248GHz, 2GB RAM, GPU is MALI G31 MP2 with OpenCL 2.0, The faster RK3328 is: RockChip RK3328 4 x Cortex-a53 1.296GHz, 4GB RAM, GPU is MALI 450 MP2 without OpenCL, Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2
Operating System: PX30 (4 x Cortex-A35): RockChip Linux SDK that is based on buildroot from 2018-03, with kernel 4.4.159
strings /lib/libc.so.6 | grep GLIB
GLIBC_2.26
RK3328 (4 x Cortex-A53): Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2
strings /lib/libc.so.6 | grep GLIB
GNU C Library (Ubuntu GLIBC 2.27-3ubuntu1) stable release version 2.27.
Problem description: Problem summary: The performabnce of a FLOAT32 matrix multiplication using the NEON instructions of a RockChip PX30 (4 x Cortex-A35) is 1 / 3 of the performance of a RockChip RK3328 (4 x Cortex-A53) with similar CPU speed.
Problem details:
| | user | system | elapsed | CPU | GFLOPS | | RK3328 | 464.44 | 4.31 | 118.94 | 394% | 22.57 | | PX30 | 1520.96 | 5.04 | 383.04 | 398% | 7.01 |
Details about the above test:
Build details: Build - from https://arm-software.github.io/ComputeLibrary/v19.02/index.xhtml#S3_how_to_build
Download sources git clone https://github.com/arm-software/ComputeLibrary && cd ComputeLibrary && git checkout v19.05
The test code (see attachment) of a matrix multiplication of FLOAT32 is a (small) modification of https://github.com/ctuning/ck-math/blob/master/program/acl-sgemm-neon-example/sgemm.cpp: cp -pi neon_sgemm_noveto.cpp examples/.
Cross compile with cross tools version 6.5 of the buildroot: scons Werror=1 os=linux arch=arm64-v8a debug=0 opencl=0 neon=1 gles_compute=0 examples=1 validation_tests=1 benchmark_tests=1 build_dir=$(pwd)/build_release -j8 \ build=cross_compile toolchain_prefix=/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/host/bin/aarch64-buildroot-linux-gnu-
Notable compiler flags: -march=armv8-a -O3 -ftree-vectorize -DARCH_ARM -DARM_COMPUTE_CPP_SCHEDULER=1 -DARM_COMPUTE_AARCH64_V8A -DNO_DOT_IN_TOOLCHAIN
neon_sgemm_noveto.cpp.gz