ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.83k stars 775 forks source link

How to use the provided SVE GEMM code? #1142

Open yohuna77777 opened 3 weeks ago

yohuna77777 commented 3 weeks ago

Hello, I am considering using the SVE instruction set to optimize GEMM operators. I found that although the repository has relevant codes, there is no example telling me how to call these gemm based on the SVE instruction set. Can you provide a relevant usage example?I am looking forward to your reply!

morgolock commented 3 weeks ago

Hi @yohuna77777

The library will take care of this for you, there is no need to do anything different when calling the operators. At runtime, the library will check if the cpu supports SVE and if this is the case choose the best SVE kernel for the given workload.

When you build the library just make sure that you use the option multi_isa=1. The prebuilt binaries we publish for Linux or Android they are both multi_isa.

Please see the sgemm example included in the library.

Hope this helps.

yohuna77777 commented 1 week ago

Hi @yohuna77777

The library will take care of this for you, there is no need to do anything different when calling the operators. At runtime, the library will check if the cpu supports SVE and if this is the case choose the best SVE kernel for the given workload.

When you build the library just make sure that you use the option multi_isa=1. The prebuilt binaries we publish for Linux or Android they are both multi_isa.

Please see the sgemm example included in the library.

Hope this helps.

Thanks for your reply. I used the prebuilt version (v24.08) to compile examples/neon_sgemm.cpp. After running the executable file, I disassembled and checked the corresponding code and found that the sve register was not used, but the neon register was used. That is to say, NEGEMM does not contain the sve optimized operator. Which runtime operator should I call to use the sve kernel? (The above experiment was conducted on an arm server that supports sve256)

morgolock commented 1 week ago

Hi @yohuna77777

You should call to NEGEMM, there is no need to do anything else. At runtime, the library will choose an SVE kernel if it can detect support for SVE. Generally speaking, the high level API in ACL is about operators/functions and you do not need to care about ISAs as this is handled for you by ACL. Each function can run one or more kernels, there is code in the library to select the best kernels based on the workload configuration and CPU features detected at runtime.

I'd say first check that the library is seeing SVE support as expected on the target platform, you can do this by running arm_compute_validation

  ./arm_compute_validation --filter-id=100 
Version = arm_compute_version=v0.0-unreleased Build options: {'os': 'linux', 'opencl': '0', 'asserts': '1', 'examples': '1', 'neon': '1', 'arch': 'armv8a', 'benchmark_examples': '0', 'multi_isa': '0', 'debug': '0', 'validation_tests': '1'} Git hash=b'ef28fedc2c34d16532aae0f019b217f684453ce3'
CommandLine = ./arm_compute_validation --filter-id=100 
Seed = 3639253183
cpu_has_sve = false
cpu_has_sve2 = false
cpu_has_svef32mm = false
cpu_has_svei8mm = false
cpu_has_svebf16 = false
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = false
cpu_has_bf16 = false
cpu_has_dotprod = false
cpu_has_i8mm = false
CPU0 = A53
CPU1 = A53
CPU2 = A53
CPU3 = A53
CPU4 = A73
CPU5 = A73
CPU6 = A73
CPU7 = A73
Iterations = 1
Threads = 1
Dataset mode = PRECOMMIT
Running [100] 'CPP/DFT/DFT1D/Complex@TensorShape=23,7'
  Wall clock/Wall clock time:    AVG=4772.0000 us

An alternative would be for you to use ACL benchmark examples along with the instrumentation to analyze which kernels are actually selected. If you build the library with benchmark_examples=1 then you can use the instruments to look into the graph example performance as shown below

root@acl_hikey_9:~/tmp/acl_mt# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--layout=NHWC,--target=NEON,--fast-math,--type=QASYMM8'
Version = arm_compute_version=v0.0-unreleased Build options: {'standalone': '0', 'test_filter': 'ActivationLayer.cpp', 'opencl': '1', 'neon': '1', 'validation_tests': '1', 'examples': '0', 'debug': '0', 'arch': 'armv8a', 'benchmark_examples': '1'} Git hash=065d46b0042cb974063e915715f3295ca265e078
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--layout=NHWC,--target=NEON,--fast-math,--type=QASYMM8 
CL_DEVICE_VERSION = OpenCL 2.0 not_released.51d50be.1502459415db9c37cbcbea279386cb09
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : QASYMM8
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : true

  SchedulerTimer/Conv/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #1:    AVG=4.2440 ms
  SchedulerTimer/Conv/CpuIm2ColKernel #0:    AVG=0.9520 ms
  SchedulerTimer/Conv_1/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #62:    AVG=2.9500 ms
  SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #63:    AVG=0.0550 ms
  SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #64:    AVG=0.6760 ms
  SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #65:    AVG=0.0590 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_qu8_logits_1d_max #66:    AVG=0.0110 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_qu8_softmax_logits_1d #67:    AVG=0.0860 ms
  SchedulerTimer/expanded_conv/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #2:    AVG=1.4960 ms
  SchedulerTimer/expanded_conv/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #3:    AVG=2.6010 ms
  SchedulerTimer/expanded_conv_1/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #4:    AVG=8.4050 ms
  SchedulerTimer/expanded_conv_1/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #5:    AVG=1.0920 ms
  SchedulerTimer/expanded_conv_1/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #6:    AVG=1.6500 ms
  SchedulerTimer/expanded_conv_10/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #37:    AVG=0.9680 ms
  SchedulerTimer/expanded_conv_10/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #38:    AVG=0.2350 ms
  SchedulerTimer/expanded_conv_10/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #39:    AVG=1.0050 ms
  SchedulerTimer/expanded_conv_11/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #40:    AVG=1.8990 ms
  SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_qu8_add #43:    AVG=0.0560 ms
  SchedulerTimer/expanded_conv_11/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #41:    AVG=0.3510 ms
  SchedulerTimer/expanded_conv_11/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #42:    AVG=1.4570 ms
  SchedulerTimer/expanded_conv_12/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #44:    AVG=1.8920 ms
  SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_qu8_add #47:    AVG=0.0550 ms
  SchedulerTimer/expanded_conv_12/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #45:    AVG=0.3530 ms
  SchedulerTimer/expanded_conv_12/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #46:    AVG=1.4610 ms
  SchedulerTimer/expanded_conv_13/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #48:    AVG=1.9520 ms
  SchedulerTimer/expanded_conv_13/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #49:    AVG=0.1260 ms
  SchedulerTimer/expanded_conv_13/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #50:    AVG=0.6480 ms
  SchedulerTimer/expanded_conv_14/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #51:    AVG=1.2200 ms
  SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_qu8_add #54:    AVG=0.0250 ms
  SchedulerTimer/expanded_conv_14/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #52:    AVG=0.1890 ms
  SchedulerTimer/expanded_conv_14/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #53:    AVG=1.0470 ms
  SchedulerTimer/expanded_conv_15/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #55:    AVG=1.2330 ms
  SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_qu8_add #58:    AVG=0.0250 ms
  SchedulerTimer/expanded_conv_15/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #56:    AVG=0.1890 ms
  SchedulerTimer/expanded_conv_15/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #57:    AVG=1.0430 ms
  SchedulerTimer/expanded_conv_16/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #59:    AVG=1.2170 ms
  SchedulerTimer/expanded_conv_16/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #60:    AVG=0.2090 ms
  SchedulerTimer/expanded_conv_16/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #61:    AVG=2.0660 ms
  SchedulerTimer/expanded_conv_2/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #7:    AVG=4.1830 ms
  SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_qu8_add #10:    AVG=0.2540 ms
  SchedulerTimer/expanded_conv_2/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #8:    AVG=1.4350 ms
  SchedulerTimer/expanded_conv_2/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #9:    AVG=2.0210 ms
  SchedulerTimer/expanded_conv_3/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #11:    AVG=4.0670 ms
  SchedulerTimer/expanded_conv_3/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #12:    AVG=0.4420 ms
  SchedulerTimer/expanded_conv_3/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #13:    AVG=0.6780 ms
  SchedulerTimer/expanded_conv_4/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #14:    AVG=1.3340 ms
  SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_qu8_add #17:    AVG=0.0780 ms
  SchedulerTimer/expanded_conv_4/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #15:    AVG=0.4780 ms
  SchedulerTimer/expanded_conv_4/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #16:    AVG=0.7980 ms
  SchedulerTimer/expanded_conv_5/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #18:    AVG=1.3290 ms
  SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_qu8_add #21:    AVG=0.0760 ms
  SchedulerTimer/expanded_conv_5/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #19:    AVG=0.4720 ms
  SchedulerTimer/expanded_conv_5/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #20:    AVG=0.7620 ms
  SchedulerTimer/expanded_conv_6/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #22:    AVG=1.3390 ms
  SchedulerTimer/expanded_conv_6/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #23:    AVG=0.1360 ms
  SchedulerTimer/expanded_conv_6/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #24:    AVG=0.3890 ms
  SchedulerTimer/expanded_conv_7/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #25:    AVG=0.9680 ms
  SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_qu8_add #28:    AVG=0.0390 ms
  SchedulerTimer/expanded_conv_7/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #26:    AVG=0.2390 ms
  SchedulerTimer/expanded_conv_7/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #27:    AVG=0.6820 ms
  SchedulerTimer/expanded_conv_8/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #29:    AVG=0.9730 ms
  SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_qu8_add #32:    AVG=0.0380 ms
  SchedulerTimer/expanded_conv_8/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #30:    AVG=0.2350 ms
  SchedulerTimer/expanded_conv_8/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #31:    AVG=0.6750 ms
  SchedulerTimer/expanded_conv_9/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #33:    AVG=0.9570 ms
  SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_qu8_add #36:    AVG=0.0380 ms
  SchedulerTimer/expanded_conv_9/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #34:    AVG=0.2380 ms
  SchedulerTimer/expanded_conv_9/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #35:    AVG=0.6730 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)

Hope this helps