GalShalif commented 5 years ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v19.05 Build options: {'install_dir': '/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/target/usr', 'gles_compute': '0', 'toolchain_prefix': '/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/host/bin/aarch64-buildroot-linux-gnu-', 'os': 'linux', 'opencl': '0', 'neon': '1', 'benchmark_tests': '1', 'validation_tests': '1', 'build': 'cross_compile', 'debug': '0', 'arch': 'arm64-v8a', 'Werror': '1', 'examples': '1'} Git hash=91b48ae406eadf0eae4296f001276ae17222dcea

Platform: RockChip PX30 4 x Cortex-a35 1.248GHz, 2GB RAM, GPU is MALI G31 MP2 with OpenCL 2.0, The faster RK3328 is: RockChip RK3328 4 x Cortex-a53 1.296GHz, 4GB RAM, GPU is MALI 450 MP2 without OpenCL, Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2

Operating System: PX30 (4 x Cortex-A35): RockChip Linux SDK that is based on buildroot from 2018-03, with kernel 4.4.159

strings /lib/libc.so.6 | grep GLIB

GLIBC_2.26

RK3328 (4 x Cortex-A53): Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2

strings /lib/libc.so.6 | grep GLIB

GNU C Library (Ubuntu GLIBC 2.27-3ubuntu1) stable release version 2.27.

Problem description: Problem summary: The performabnce of a FLOAT32 matrix multiplication using the NEON instructions of a RockChip PX30 (4 x Cortex-A35) is 1 / 3 of the performance of a RockChip RK3328 (4 x Cortex-A53) with similar CPU speed.

Problem details:

Both test machines are a RockChip SoC:
- PX30 4 x Cortex-a35 1.248GHz, 2GB RAM, GPU is MALI G31 MP2 with OpenCL 2.0, Buildroot with kernel 4.4.159
- RK3328 4 x Cortex-a53 1.296GHz, 4GB RAM, GPU is MALI 450 MP2 without OpenCL, 2. Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2
Test code of a FLOAT32 matrix NEON multiplication is neon_sgemm_noveto is a (small) modification of https://github.com/ctuning/ck-math/blob/master/program/acl-sgemm-neon-example/sgemm.cpp
- The code was compiled once and the same binaries where installed on both test machine
Multiplication of a FLOAT32 matrix of size 512x512 (10,000 iterations) was done with:
- MM=512 MN=512 MK=512 NUM_REPETITIONS=10000 time neon_sgemm_noveto
Results from the above run:

| | user | system | elapsed | CPU | GFLOPS | | RK3328 | 464.44 | 4.31 | 118.94 | 394% | 22.57 | | PX30 | 1520.96 | 5.04 | 383.04 | 398% | 7.01 |

Details about the above test:

Tested with both 19.05 and 18.05 versions (from https://github.com/arm-software/ComputeLibrary)
Compiler: cross compile with g++ version aarch64-buildroot-linux-gnu-g++.br_real (Buildroot 2018.02-rc3-01103-ge346aa9215) 6.5.0
The CPU detection of Cortex-A35 was verified to be correct - it is detected as a Cortex-A53 CPU governor was set to "performance"
Temperature is fine and CPUs are not throttling to a lower frequency
Before running the test: vmstat report 99% idle for the RK3328 and 100% idle for the PX30
If compiled with debug=1, then performance results are about 15% lower - but still, the RK3328 is x3 faster then the PX30
Changing the matrix size to 128x128 or 1024x1024 yield similar results

Build details: Build - from https://arm-software.github.io/ComputeLibrary/v19.02/index.xhtml#S3_how_to_build

Download sources git clone https://github.com/arm-software/ComputeLibrary && cd ComputeLibrary && git checkout v19.05
The test code (see attachment) of a matrix multiplication of FLOAT32 is a (small) modification of https://github.com/ctuning/ck-math/blob/master/program/acl-sgemm-neon-example/sgemm.cpp: cp -pi neon_sgemm_noveto.cpp examples/.
Cross compile with cross tools version 6.5 of the buildroot: scons Werror=1 os=linux arch=arm64-v8a debug=0 opencl=0 neon=1 gles_compute=0 examples=1 validation_tests=1 benchmark_tests=1 build_dir=$(pwd)/build_release -j8 \ build=cross_compile toolchain_prefix=/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/host/bin/aarch64-buildroot-linux-gnu-
Notable compiler flags: -march=armv8-a -O3 -ftree-vectorize -DARCH_ARM -DARM_COMPUTE_CPP_SCHEDULER=1 -DARM_COMPUTE_AARCH64_V8A -DNO_DOT_IN_TOOLCHAIN

neon_sgemm_noveto.cpp.gz

morgolock commented 5 years ago

Hi @GalShalif

Could you please share the results of lscpu on both systems?

It would be interesting to compile the library with cppthreads=0 and openmp0, this will disable multithreading and let us now if there is a problem in the scheduling causing the performance drop.

GalShalif commented 5 years ago

CPU information: PX30 (4 x Cortex-A35):

Governor cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor performance
lscpu output: Architecture: aarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Model: 2 CPU max MHz: 1248.0000 CPU min MHz: 408.0000 BogoMIPS: 48.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32

RK3328 (4 x Cortex-A53):

Governor cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor performance
lscpu output: Architecture: aarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: ARM Model: 4 Model name: Cortex-A53 Stepping: r0p4 CPU max MHz: 1296.0000 CPU min MHz: 408.0000 BogoMIPS: 48.00 L1d cache: unknown size L1i cache: unknown size L2 cache: unknown size NUMA node0 CPU(s): 0-3 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

GalShalif commented 5 years ago

Disabling multi-threading did not help - the performance difference remain when re-compiling the Compute Library without multi-threading:

Re-compile the ARM Compute Library without multi-threading with the compilation flags: cppthreads=0 openmp=0

Re-run the benchmark test (multiplication of a FLOAT32 matrix of size 512x512): MM=512 MN=512 MK=512 NUM_REPETITIONS=10000 time neon_sgemm_noveto

machine user system elapsed CPU GFLOPS RK3328 397.49 1.78 399.39 99% 6.72 PX30 1493.15 1.28 1494.53 99% 1.79

morgolock commented 5 years ago

Hi @GalShalif

The A35 is a lower specification core than the A53, lower performance is expected.

GalShalif commented 5 years ago

The A35 is a lower specification core than the A53 but only 80% of A53 and not 31%:

PX30 (4 x A35 @ 1.248GHz) is 31% (7.01 GFLOPS) of the RK3328 (4 x A53 @ 1.296GHz) (22.57 GFLOPS) - see above
Cortex-A35 is match 80-100% of the Cortex-A53 performance (depending on use-case) - [https://www.anandtech.com/show/9769/arm-announces-cortex-a35]

GalShalif commented 5 years ago

P.S. Thanks for your help, I will contact RockChip (the manufacturer of the PX30 and RK3328) for more information regarding the NEON performance of their CPUs.

morgolock commented 5 years ago

Hi @GalShalif

ACL includes some highly optimised A53 kernels and this work has not been done yet for the A35, so I think this combined with the fact that the A35 is a lower spec core than the A53 would explain the results you see.

GalShalif commented 5 years ago

Thanks for the update - I will forward the information to RockChip (the manufacturer of the PX30 and RK3328).

mhdfasilwyd commented 5 months ago

Guys any further information related to this thread, I'm facing similar issues when using NEON libraries, can anything be done to optimise this library for cortex-a35. Any optimisation tips would be useful to me....

ARM-software / ComputeLibrary

NEON performance of PX30 (4 x Cortex-A35) is 1 / 3 of RK3328 (4 x Cortex-A53) #719

strings /lib/libc.so.6 | grep GLIB

strings /lib/libc.so.6 | grep GLIB

Disabling multi-threading did not help - the performance difference remain when re-compiling the Compute Library without multi-threading: