ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

How can I benchmark GEMMs with `arm_compute_benchmark`? #1115

Closed FabianSchuetze closed 1 day ago

FabianSchuetze commented 6 days ago

Thanks for the wonderful library. Apologies if this seems to be a silly question:

How can I benchmark gemms on a android target?

In line with the docs for test I run the following on my target (including output):

gts9:/data/local/tmp $ LD_LIBRARY_PATH=$PWD ./arm_compute_benchmark --mode=precommit                                                                                                                       
Version = arm_compute_version=v24.06 Build options: {'toolchain_prefix': 'aarch64-linux-android33-', 'opencl': '0', 'arch': 'armv8.6-a', 'build': 'cross_compile', 'os': 'android', 'benchmark_tests': '1', 'embed_kernels': '1'} Git hash=b'505adb91d40e05b3f80a075a4467a78a253395e1'
CommandLine = ./arm_compute_benchmark 
Seed = 1148538226
cpu_has_sve = false
cpu_has_sve2 = false
cpu_has_svef32mm = false
cpu_has_svei8mm = false
cpu_has_svebf16 = false
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = true
cpu_has_bf16 = true
cpu_has_dotprod = true
cpu_has_i8mm = true
CPU0 = A510
CPU1 = A510
CPU2 = A510
CPU3 = GENERIC
CPU4 = GENERIC
CPU5 = GENERIC
CPU6 = GENERIC
CPU7 = GENERIC
Iterations = 1
Threads = 1
Dataset mode = PRECOMMIT
Running [0] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1892.0000 us
Running [1] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1894.0000 us
Running [2] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1939.0000 us
Running [3] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3446.0000 us
Running [4] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3439.0000 us
Running [5] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3836.0000 us
Running [6] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1172.0000 us
Running [7] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1121.0000 us
Running [8] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1133.0000 us
Running [9] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3530.0000 us
Running [10] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3548.0000 us
Running [11] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=2594.0000 us
Running [12] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1113.0000 us
Running [13] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1114.0000 us
Running [14] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1279.0000 us
Running [15] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=2387.0000 us
Running [16] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=2317.0000 us
Running [17] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1500.0000 us
Running [18] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1644.0000 us
Running [19] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1622.0000 us
Running [20] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1622.0000 us
Running [21] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3305.0000 us
Running [22] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3277.0000 us
Running [23] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3040.0000 us
Running [24] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1417.0000 us
Running [25] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1349.0000 us
Running [26] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1379.0000 us
Running [27] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3388.0000 us
Running [28] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3346.0000 us
Running [29] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=2433.0000 us
Running [30] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1365.0000 us
Running [31] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1358.0000 us
Running [32] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=1365.0000 us
Running [33] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3183.0000 us
Running [34] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=3251.0000 us
Running [35] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
  Wall clock/Wall clock time:    AVG=2341.0000 us
Executed 36 test(s) (36 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)
gts9:/data/local/tmp $    

However, only Scale benchmarks seem to be run.

I am interested in running Int8 GEMMS with (Int 32 accumulator) and obtain the GFLOPS/sec my target supports. I would like to use all cores on my system. I would best like to test the SMMLA (UMMLA) instructions.

I build the arm_compute_benchmark binary with the following command:

CC=clang CXX=clang++ scons -j8 toolchain_prefix=aarch64-linux-android33- opencl=0 arch=armv8.6-a build=cross_compile os=android benchmark_tests=1 embed_kernels=1 neon=1

I also had to slightly modify the Sconstruct file, the patch is below ( a bit hacky, but I'm only interest in cross-compilation):


diff --git a/SConstruct b/SConstruct
index bad85e503d..5282a8d537 100644
--- a/SConstruct
+++ b/SConstruct
@@ -418,9 +418,10 @@ if env['os'] == 'windows':
     env['AR'] = "llvm-lib"
     env['RANLIB'] = "llvm-ranlib"
 else:
-    env['AR'] = toolchain_prefix + "ar"
+    # env['AR'] = toolchain_prefix + "clang++"
+    env['AR'] = "llvm-ar"

-env['RANLIB'] = toolchain_prefix + "ranlib"
+env['RANLIB'] = "llvm-ranlib"

 print("Using compilers:")
 print("CC", env['CC'])
FabianSchuetze commented 1 day ago

I finally figured it out.

The library needs to be built with the additional option benchmark_examples and the test is run on the device with: ./benchmark_neon_sgemm --iterations=100 --example_args=2048,2048,2048