ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.83k stars 775 forks source link

NEON performance of PX30 (4 x Cortex-A35) is 1 / 3 of RK3328 (4 x Cortex-A53) #719

Closed GalShalif closed 5 years ago

GalShalif commented 5 years ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v19.05 Build options: {'install_dir': '/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/target/usr', 'gles_compute': '0', 'toolchain_prefix': '/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/host/bin/aarch64-buildroot-linux-gnu-', 'os': 'linux', 'opencl': '0', 'neon': '1', 'benchmark_tests': '1', 'validation_tests': '1', 'build': 'cross_compile', 'debug': '0', 'arch': 'arm64-v8a', 'Werror': '1', 'examples': '1'} Git hash=91b48ae406eadf0eae4296f001276ae17222dcea

Platform: RockChip PX30 4 x Cortex-a35 1.248GHz, 2GB RAM, GPU is MALI G31 MP2 with OpenCL 2.0, The faster RK3328 is: RockChip RK3328 4 x Cortex-a53 1.296GHz, 4GB RAM, GPU is MALI 450 MP2 without OpenCL, Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2

Operating System: PX30 (4 x Cortex-A35): RockChip Linux SDK that is based on buildroot from 2018-03, with kernel 4.4.159

strings /lib/libc.so.6 | grep GLIB

GLIBC_2.26

RK3328 (4 x Cortex-A53): Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2

strings /lib/libc.so.6 | grep GLIB

GNU C Library (Ubuntu GLIBC 2.27-3ubuntu1) stable release version 2.27.

Problem description: Problem summary: The performabnce of a FLOAT32 matrix multiplication using the NEON instructions of a RockChip PX30 (4 x Cortex-A35) is 1 / 3 of the performance of a RockChip RK3328 (4 x Cortex-A53) with similar CPU speed.

Problem details:

  1. Both test machines are a RockChip SoC:
    • PX30 4 x Cortex-a35 1.248GHz, 2GB RAM, GPU is MALI G31 MP2 with OpenCL 2.0, Buildroot with kernel 4.4.159
    • RK3328 4 x Cortex-a53 1.296GHz, 4GB RAM, GPU is MALI 450 MP2 without OpenCL, 2. Ubuntu 18.04 with kernel 4.15.0-rockchip-ayufan-191-g29128dea2
  2. Test code of a FLOAT32 matrix NEON multiplication is neon_sgemm_noveto is a (small) modification of ​https://github.com/ctuning/ck-math/blob/master/program/acl-sgemm-neon-example/sgemm.cpp
    • The code was compiled once and the same binaries where installed on both test machine
  3. Multiplication of a FLOAT32 matrix of size 512x512 (10,000 iterations) was done with:
    • MM=512 MN=512 MK=512 NUM_REPETITIONS=10000 time neon_sgemm_noveto
  4. Results from the above run:

| | user | system | elapsed | CPU | GFLOPS | | RK3328 | 464.44 | 4.31 | 118.94 | 394% | 22.57 | | PX30 | 1520.96 | 5.04 | 383.04 | 398% | 7.01 |

Details about the above test:

  1. Tested with both 19.05 and 18.05 versions (from ​https://github.com/arm-software/ComputeLibrary)
  2. Compiler: cross compile with g++ version aarch64-buildroot-linux-gnu-g++.br_real (Buildroot 2018.02-rc3-01103-ge346aa9215) 6.5.0
  3. The CPU detection of Cortex-A35 was verified to be correct - it is detected as a Cortex-A53 CPU governor was set to "performance"
  4. Temperature is fine and CPUs are not throttling to a lower frequency
  5. Before running the test: vmstat report 99% idle for the RK3328 and 100% idle for the PX30
  6. If compiled with debug=1, then performance results are about 15% lower - but still, the RK3328 is x3 faster then the PX30
  7. Changing the matrix size to 128x128 or 1024x1024 yield similar results

Build details: Build - from ​https://arm-software.github.io/ComputeLibrary/v19.02/index.xhtml#S3_how_to_build

  1. Download sources git clone https://github.com/arm-software/ComputeLibrary && cd ComputeLibrary && git checkout v19.05

  2. The test code (see attachment) of a matrix multiplication of FLOAT32 is a (small) modification of ​https://github.com/ctuning/ck-math/blob/master/program/acl-sgemm-neon-example/sgemm.cpp: cp -pi neon_sgemm_noveto.cpp examples/.

  3. Cross compile with cross tools version 6.5 of the buildroot: scons Werror=1 os=linux arch=arm64-v8a debug=0 opencl=0 neon=1 gles_compute=0 examples=1 validation_tests=1 benchmark_tests=1 build_dir=$(pwd)/build_release -j8 \ build=cross_compile toolchain_prefix=/home/gals/SOC/PX30/PX30_Linux_SDK_v1.0_20190508/px30/buildroot/output/rockchip_px30_64/host/bin/aarch64-buildroot-linux-gnu-

  4. Notable compiler flags: -march=armv8-a -O3 -ftree-vectorize -DARCH_ARM -DARM_COMPUTE_CPP_SCHEDULER=1 -DARM_COMPUTE_AARCH64_V8A -DNO_DOT_IN_TOOLCHAIN

neon_sgemm_noveto.cpp.gz

morgolock commented 5 years ago

Hi @GalShalif

Could you please share the results of lscpu on both systems?

It would be interesting to compile the library with cppthreads=0 and openmp0, this will disable multithreading and let us now if there is a problem in the scheduling causing the performance drop.

GalShalif commented 5 years ago

CPU information: PX30 (4 x Cortex-A35):

RK3328 (4 x Cortex-A53):

GalShalif commented 5 years ago

Disabling multi-threading did not help - the performance difference remain when re-compiling the Compute Library without multi-threading:

Re-compile the ARM Compute Library without multi-threading with the compilation flags: cppthreads=0 openmp=0

Re-run the benchmark test (multiplication of a FLOAT32 matrix of size 512x512): MM=512 MN=512 MK=512 NUM_REPETITIONS=10000 time neon_sgemm_noveto

machine user system elapsed CPU GFLOPS RK3328 397.49 1.78 399.39 99% 6.72 PX30 1493.15 1.28 1494.53 99% 1.79

morgolock commented 5 years ago

Hi @GalShalif

The A35 is a lower specification core than the A53, lower performance is expected.

GalShalif commented 5 years ago

The A35 is a lower specification core than the A53 but only 80% of A53 and not 31%:

GalShalif commented 5 years ago

P.S. Thanks for your help, I will contact RockChip (the manufacturer of the PX30 and RK3328) for more information regarding the NEON performance of their CPUs.

morgolock commented 5 years ago

Hi @GalShalif

ACL includes some highly optimised A53 kernels and this work has not been done yet for the A35, so I think this combined with the fact that the A35 is a lower spec core than the A53 would explain the results you see.

GalShalif commented 5 years ago

Thanks for the update - I will forward the information to RockChip (the manufacturer of the PX30 and RK3328).

mhdfasilwyd commented 5 months ago

Guys any further information related to this thread, I'm facing similar issues when using NEON libraries, can anything be done to optimise this library for cortex-a35. Any optimisation tips would be useful to me....