ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.83k stars 774 forks source link

NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

Open eshoguli opened 2 months ago

eshoguli commented 2 months ago

Issues:

  1. Default selected low precision kernel is not optimal for described below platform.
  2. We have only 30% performance gain for low precision kernel VS fp16 in multithreaded mode. Can you confirm, please, that these are the results you expect? Our expected performance gain was 2x.

Platform:

system_profiler SPHardwareDataType
   Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac15,6
      Chip: Apple M3 Pro
      Total Number of Cores: 12 (6 performance and 6 efficiency)
      Memory: 18 GB

Operating System:

ProductName:        macOS
ProductVersion:     14.2.1
BuildVersion:       23C71

Command line

scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=0 os=macos data_layout_support=all  build=native asserts=1 --jobs=8 --silent os=macos build=native fixed_format_kernels=True validation_tests=1 examples=1 debug=0

Single thread: cppthreads=0 Multithread: cppthreads=1

Results fp16, default kernel: a64_hybrid_fp16_mla_6x32) Single thread, shapes: 4096x128 * 128x4096

fp16 time median time = 17373 microsecs

Multithread thread, shapes: 4096x128 * 128x4096

fp16 time median time = 2919 microsecs

Results int8, default selected kernel: a64_hybrid_s8s32_mmla_6x16 Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12573 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 3595 microsecs

Results int8, manual selected kernel: a64_interleaved_s8s32_mmla_8x12 Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12598 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 2113 microsecs
morgolock commented 2 months ago

Hi @eshoguli

What is the data layout used for these workloads when calling into ACL? It would help if you could build ACL with logging=1 so that we can know more details about these workloads.

eshoguli commented 2 months ago

you can use eshoguli:es/neon_gemm_s8s8s32_perf_default branch to easily reproduce the issue