Closed yohuna77777 closed 1 month ago
Hi @yohuna77777
The library will take care of this for you, there is no need to do anything different when calling the operators. At runtime, the library will check if the cpu supports SVE and if this is the case choose the best SVE kernel for the given workload.
When you build the library just make sure that you use the option multi_isa=1
. The prebuilt binaries we publish for Linux or Android they are both multi_isa.
Please see the sgemm example included in the library.
Hope this helps.
Hi @yohuna77777
The library will take care of this for you, there is no need to do anything different when calling the operators. At runtime, the library will check if the cpu supports SVE and if this is the case choose the best SVE kernel for the given workload.
When you build the library just make sure that you use the option
multi_isa=1
. The prebuilt binaries we publish for Linux or Android they are both multi_isa.Please see the sgemm example included in the library.
Hope this helps.
Thanks for your reply. I used the prebuilt version (v24.08) to compile examples/neon_sgemm.cpp. After running the executable file, I disassembled and checked the corresponding code and found that the sve register was not used, but the neon register was used. That is to say, NEGEMM does not contain the sve optimized operator. Which runtime operator should I call to use the sve kernel? (The above experiment was conducted on an arm server that supports sve256)
Hi @yohuna77777
You should call to NEGEMM
, there is no need to do anything else. At runtime, the library will choose an SVE kernel if it can detect support for SVE. Generally speaking, the high level API in ACL is about operators/functions and you do not need to care about ISAs as this is handled for you by ACL. Each function can run one or more kernels, there is code in the library to select the best kernels based on the workload configuration and CPU features detected at runtime.
I'd say first check that the library is seeing SVE support as expected on the target platform, you can do this by running arm_compute_validation
./arm_compute_validation --filter-id=100
Version = arm_compute_version=v0.0-unreleased Build options: {'os': 'linux', 'opencl': '0', 'asserts': '1', 'examples': '1', 'neon': '1', 'arch': 'armv8a', 'benchmark_examples': '0', 'multi_isa': '0', 'debug': '0', 'validation_tests': '1'} Git hash=b'ef28fedc2c34d16532aae0f019b217f684453ce3'
CommandLine = ./arm_compute_validation --filter-id=100
Seed = 3639253183
cpu_has_sve = false
cpu_has_sve2 = false
cpu_has_svef32mm = false
cpu_has_svei8mm = false
cpu_has_svebf16 = false
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = false
cpu_has_bf16 = false
cpu_has_dotprod = false
cpu_has_i8mm = false
CPU0 = A53
CPU1 = A53
CPU2 = A53
CPU3 = A53
CPU4 = A73
CPU5 = A73
CPU6 = A73
CPU7 = A73
Iterations = 1
Threads = 1
Dataset mode = PRECOMMIT
Running [100] 'CPP/DFT/DFT1D/Complex@TensorShape=23,7'
Wall clock/Wall clock time: AVG=4772.0000 us
An alternative would be for you to use ACL benchmark examples along with the instrumentation to analyze which kernels are actually selected. If you build the library with benchmark_examples=1
then you can use the instruments to look into the graph example performance as shown below
root@acl_hikey_9:~/tmp/acl_mt# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--layout=NHWC,--target=NEON,--fast-math,--type=QASYMM8'
Version = arm_compute_version=v0.0-unreleased Build options: {'standalone': '0', 'test_filter': 'ActivationLayer.cpp', 'opencl': '1', 'neon': '1', 'validation_tests': '1', 'examples': '0', 'debug': '0', 'arch': 'armv8a', 'benchmark_examples': '1'} Git hash=065d46b0042cb974063e915715f3295ca265e078
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--layout=NHWC,--target=NEON,--fast-math,--type=QASYMM8
CL_DEVICE_VERSION = OpenCL 2.0 not_released.51d50be.1502459415db9c37cbcbea279386cb09
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : QASYMM8
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file :
MLGO file :
Fast math enabled? : true
SchedulerTimer/Conv/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #1: AVG=4.2440 ms
SchedulerTimer/Conv/CpuIm2ColKernel #0: AVG=0.9520 ms
SchedulerTimer/Conv_1/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #62: AVG=2.9500 ms
SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #63: AVG=0.0550 ms
SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #64: AVG=0.6760 ms
SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #65: AVG=0.0590 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_qu8_logits_1d_max #66: AVG=0.0110 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_qu8_softmax_logits_1d #67: AVG=0.0860 ms
SchedulerTimer/expanded_conv/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #2: AVG=1.4960 ms
SchedulerTimer/expanded_conv/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #3: AVG=2.6010 ms
SchedulerTimer/expanded_conv_1/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #4: AVG=8.4050 ms
SchedulerTimer/expanded_conv_1/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #5: AVG=1.0920 ms
SchedulerTimer/expanded_conv_1/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #6: AVG=1.6500 ms
SchedulerTimer/expanded_conv_10/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #37: AVG=0.9680 ms
SchedulerTimer/expanded_conv_10/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #38: AVG=0.2350 ms
SchedulerTimer/expanded_conv_10/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #39: AVG=1.0050 ms
SchedulerTimer/expanded_conv_11/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #40: AVG=1.8990 ms
SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_qu8_add #43: AVG=0.0560 ms
SchedulerTimer/expanded_conv_11/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #41: AVG=0.3510 ms
SchedulerTimer/expanded_conv_11/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #42: AVG=1.4570 ms
SchedulerTimer/expanded_conv_12/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #44: AVG=1.8920 ms
SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_qu8_add #47: AVG=0.0550 ms
SchedulerTimer/expanded_conv_12/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #45: AVG=0.3530 ms
SchedulerTimer/expanded_conv_12/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #46: AVG=1.4610 ms
SchedulerTimer/expanded_conv_13/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #48: AVG=1.9520 ms
SchedulerTimer/expanded_conv_13/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #49: AVG=0.1260 ms
SchedulerTimer/expanded_conv_13/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #50: AVG=0.6480 ms
SchedulerTimer/expanded_conv_14/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #51: AVG=1.2200 ms
SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_qu8_add #54: AVG=0.0250 ms
SchedulerTimer/expanded_conv_14/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #52: AVG=0.1890 ms
SchedulerTimer/expanded_conv_14/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #53: AVG=1.0470 ms
SchedulerTimer/expanded_conv_15/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #55: AVG=1.2330 ms
SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_qu8_add #58: AVG=0.0250 ms
SchedulerTimer/expanded_conv_15/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #56: AVG=0.1890 ms
SchedulerTimer/expanded_conv_15/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #57: AVG=1.0430 ms
SchedulerTimer/expanded_conv_16/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #59: AVG=1.2170 ms
SchedulerTimer/expanded_conv_16/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #60: AVG=0.2090 ms
SchedulerTimer/expanded_conv_16/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #61: AVG=2.0660 ms
SchedulerTimer/expanded_conv_2/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #7: AVG=4.1830 ms
SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_qu8_add #10: AVG=0.2540 ms
SchedulerTimer/expanded_conv_2/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #8: AVG=1.4350 ms
SchedulerTimer/expanded_conv_2/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #9: AVG=2.0210 ms
SchedulerTimer/expanded_conv_3/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #11: AVG=4.0670 ms
SchedulerTimer/expanded_conv_3/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #12: AVG=0.4420 ms
SchedulerTimer/expanded_conv_3/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #13: AVG=0.6780 ms
SchedulerTimer/expanded_conv_4/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #14: AVG=1.3340 ms
SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_qu8_add #17: AVG=0.0780 ms
SchedulerTimer/expanded_conv_4/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #15: AVG=0.4780 ms
SchedulerTimer/expanded_conv_4/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #16: AVG=0.7980 ms
SchedulerTimer/expanded_conv_5/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #18: AVG=1.3290 ms
SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_qu8_add #21: AVG=0.0760 ms
SchedulerTimer/expanded_conv_5/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #19: AVG=0.4720 ms
SchedulerTimer/expanded_conv_5/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #20: AVG=0.7620 ms
SchedulerTimer/expanded_conv_6/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #22: AVG=1.3390 ms
SchedulerTimer/expanded_conv_6/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s2_output2x2_mla_depthfirst #23: AVG=0.1360 ms
SchedulerTimer/expanded_conv_6/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #24: AVG=0.3890 ms
SchedulerTimer/expanded_conv_7/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #25: AVG=0.9680 ms
SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_qu8_add #28: AVG=0.0390 ms
SchedulerTimer/expanded_conv_7/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #26: AVG=0.2390 ms
SchedulerTimer/expanded_conv_7/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #27: AVG=0.6820 ms
SchedulerTimer/expanded_conv_8/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #29: AVG=0.9730 ms
SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_qu8_add #32: AVG=0.0380 ms
SchedulerTimer/expanded_conv_8/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #30: AVG=0.2350 ms
SchedulerTimer/expanded_conv_8/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #31: AVG=0.6750 ms
SchedulerTimer/expanded_conv_9/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #33: AVG=0.9570 ms
SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_qu8_add #36: AVG=0.0380 ms
SchedulerTimer/expanded_conv_9/depthwise/depthwise/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_u8q_nhwc_3x3_s1_output2x2_mla_depthfirst #34: AVG=0.2380 ms
SchedulerTimer/expanded_conv_9/project/Conv2D/CpuGemmAssemblyWrapperKernel/a64_gemm_u8_4x4 #35: AVG=0.6730 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)
Hope this helps
Hello, I am considering using the SVE instruction set to optimize GEMM operators. I found that although the repository has relevant codes, there is no example telling me how to call these gemm based on the SVE instruction set. Can you provide a relevant usage example?I am looking forward to your reply!