ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
2.87k stars 783 forks source link

Benchmark for cpu one core and one thread #1062

Closed wenhyan closed 1 year ago

wenhyan commented 1 year ago

Hi, Do you have some benchmark data by one cpu core and one thread? Or if I want to do some benchmark, what to do that?

morgolock commented 1 year ago

Hi @wenhyan

If you build the library with benchmark_examples=1 then you can use the instruments to look into the graph examples performance as shown below:

~/tmp/acl_mt# LD_LIBRARY_PATH=./main_release-logging/:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math,--threads=1 
Version = e6209e1df1094b582cd427c81fc289a42c495ad6
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math,--threads=1 
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : true

  SchedulerTimer/Conv+Conv/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #2:    AVG=2.5130 ms
  SchedulerTimer/Conv+Conv/BatchNorm/CpuIm2ColKernel #1:    AVG=3.5220 ms
  SchedulerTimer/Conv+Conv/BatchNorm/CpuWeightsReshapeKernel #0:    AVG=0.0410 ms
  SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #97:    AVG=4.4310 ms
  SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuWeightsReshapeKernel #96:    AVG=8.6160 ms
  SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #98:    AVG=0.3090 ms
  SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #99:    AVG=1.1030 ms
  SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #100:    AVG=0.1070 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_fp32_logits_1d_max #101:    AVG=0.2070 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_fp32_softmax_logits_1d #102:    AVG=0.0890 ms
  SchedulerTimer/expanded_conv/depthwise/depthwise+expanded_conv/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #3:    AVG=3.0900 ms
  SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #5:    AVG=1.3380 ms
  SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuWeightsReshapeKernel #4:    AVG=0.0330 ms
  SchedulerTimer/expanded_conv_1/depthwise/depthwise+expanded_conv_1/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #8:    AVG=3.4450 ms
  SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_smallK_hybrid_fp32_mla_6x4 #7:    AVG=5.2310 ms
  SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuWeightsReshapeKernel #6:    AVG=0.0510 ms
  SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #10:    AVG=1.6070 ms
  SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuWeightsReshapeKernel #9:    AVG=0.0610 ms
  SchedulerTimer/expanded_conv_10/depthwise/depthwise+expanded_conv_10/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #59:    AVG=0.2540 ms
  SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #58:    AVG=1.0320 ms
  SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuWeightsReshapeKernel #57:    AVG=0.5140 ms
  SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #61:    AVG=1.4140 ms
  SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuWeightsReshapeKernel #60:    AVG=0.7220 ms
  SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_fp32_add #67:    AVG=0.0310 ms
  SchedulerTimer/expanded_conv_11/depthwise/depthwise+expanded_conv_11/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #64:    AVG=0.5550 ms
  SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #63:    AVG=2.3590 ms
  SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuWeightsReshapeKernel #62:    AVG=1.1460 ms
  SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #66:    AVG=2.2200 ms
  SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuWeightsReshapeKernel #65:    AVG=1.0830 ms
  SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_fp32_add #73:    AVG=0.0280 ms
  SchedulerTimer/expanded_conv_12/depthwise/depthwise+expanded_conv_12/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #70:    AVG=0.4280 ms
  SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #69:    AVG=2.5380 ms
  SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuWeightsReshapeKernel #68:    AVG=1.1480 ms
  SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #72:    AVG=2.1850 ms
  SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuWeightsReshapeKernel #71:    AVG=1.0840 ms
  SchedulerTimer/expanded_conv_13/depthwise/depthwise+expanded_conv_13/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #76:    AVG=0.2700 ms
  SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #75:    AVG=2.4880 ms
  SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuWeightsReshapeKernel #74:    AVG=1.1470 ms
  SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #78:    AVG=1.0430 ms
  SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuWeightsReshapeKernel #77:    AVG=1.8020 ms
  SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_fp32_add #84:    AVG=0.0230 ms
  SchedulerTimer/expanded_conv_14/depthwise/depthwise+expanded_conv_14/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #81:    AVG=0.2520 ms
  SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #80:    AVG=1.6480 ms
  SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuWeightsReshapeKernel #79:    AVG=3.1190 ms
  SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #83:    AVG=1.6420 ms
  SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuWeightsReshapeKernel #82:    AVG=3.0550 ms
  SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_fp32_add #90:    AVG=0.0220 ms
  SchedulerTimer/expanded_conv_15/depthwise/depthwise+expanded_conv_15/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #87:    AVG=0.2840 ms
  SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #86:    AVG=1.6900 ms
  SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuWeightsReshapeKernel #85:    AVG=3.1830 ms
  SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #89:    AVG=1.6320 ms
  SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuWeightsReshapeKernel #88:    AVG=2.9910 ms
  SchedulerTimer/expanded_conv_16/depthwise/depthwise+expanded_conv_16/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #93:    AVG=0.2840 ms
  SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #92:    AVG=1.6430 ms
  SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuWeightsReshapeKernel #91:    AVG=3.1740 ms
  SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #95:    AVG=3.3840 ms
  SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuWeightsReshapeKernel #94:    AVG=6.1190 ms
  SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_fp32_add #16:    AVG=0.1270 ms
  SchedulerTimer/expanded_conv_2/depthwise/depthwise+expanded_conv_2/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #13:    AVG=2.8090 ms
  SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #12:    AVG=2.4810 ms
  SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuWeightsReshapeKernel #11:    AVG=0.0850 ms
  SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #15:    AVG=2.4890 ms
  SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuWeightsReshapeKernel #14:    AVG=0.0850 ms
  SchedulerTimer/expanded_conv_3/depthwise/depthwise+expanded_conv_3/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #19:    AVG=0.9390 ms
  SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #18:    AVG=2.4380 ms
  SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuWeightsReshapeKernel #17:    AVG=0.0880 ms
  SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #21:    AVG=0.7390 ms
  SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuWeightsReshapeKernel #20:    AVG=0.1070 ms
  SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_fp32_add #27:    AVG=0.0320 ms
  SchedulerTimer/expanded_conv_4/depthwise/depthwise+expanded_conv_4/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #24:    AVG=0.5840 ms
  SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #23:    AVG=1.0290 ms
  SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuWeightsReshapeKernel #22:    AVG=0.2110 ms
  SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #26:    AVG=0.9900 ms
  SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuWeightsReshapeKernel #25:    AVG=0.1250 ms
  SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_fp32_add #33:    AVG=0.0300 ms
  SchedulerTimer/expanded_conv_5/depthwise/depthwise+expanded_conv_5/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #30:    AVG=0.5200 ms
  SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #29:    AVG=1.1130 ms
  SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuWeightsReshapeKernel #28:    AVG=0.1400 ms
  SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #32:    AVG=0.9270 ms
  SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuWeightsReshapeKernel #31:    AVG=0.1260 ms
  SchedulerTimer/expanded_conv_6/depthwise/depthwise+expanded_conv_6/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #36:    AVG=0.2770 ms
  SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #35:    AVG=1.0730 ms
  SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuWeightsReshapeKernel #34:    AVG=0.1390 ms
  SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #38:    AVG=0.4550 ms
  SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuWeightsReshapeKernel #37:    AVG=0.2470 ms
  SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_fp32_add #44:    AVG=0.0200 ms
  SchedulerTimer/expanded_conv_7/depthwise/depthwise+expanded_conv_7/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #41:    AVG=0.3410 ms
  SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #40:    AVG=1.0210 ms
  SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuWeightsReshapeKernel #39:    AVG=0.5130 ms
  SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #43:    AVG=1.0330 ms
  SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuWeightsReshapeKernel #42:    AVG=0.4840 ms
  SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_fp32_add #50:    AVG=0.0160 ms
  SchedulerTimer/expanded_conv_8/depthwise/depthwise+expanded_conv_8/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #47:    AVG=0.2520 ms
  SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #46:    AVG=1.0650 ms
  SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuWeightsReshapeKernel #45:    AVG=0.5130 ms
  SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #49:    AVG=0.9930 ms
  SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuWeightsReshapeKernel #48:    AVG=0.4850 ms
  SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_fp32_add #56:    AVG=0.0180 ms
  SchedulerTimer/expanded_conv_9/depthwise/depthwise+expanded_conv_9/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #53:    AVG=0.2520 ms
  SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #52:    AVG=1.0380 ms
  SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuWeightsReshapeKernel #51:    AVG=0.5130 ms
  SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #55:    AVG=0.9770 ms
  SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuWeightsReshapeKernel #54:    AVG=0.4840 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)

Alternatively you could use the tensorflow benchmark tool and the armnn delegate as shown below:

$ LD_LIBRARY_PATH=./armnn/main/:$LD_LIBRARY_PATH ./linux_aarch64_benchmark_model --graph=./wdsr_960.tflite --num_threads=4 --num_runs=120 --warmup_runs=1 --external_delegate_path="armnn/main/libarmnnDelegate.so" --external_delegate_options="backends:CpuAcc"
STARTING!
Log parameter values verbosely: [0]
Min num runs: [120]
Num threads: [4]
Min warmup runs: [1]
Graph: [./wdsr_960.tflite]
#threads used for CPU inference: [4]
External delegate path: [armnn/main/libarmnnDelegate.so]
External delegate options: [backends:CpuAcc]
Loaded model ./wdsr_960.tflite
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 0.011828
Initialized session in 31.93ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=23 first=40037 curr=26812 min=18118 max=40037 avg=22013.8 std=4449

Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=120 first=25867 curr=37586 min=24979 max=46407 avg=33329.7 std=2946

Inference timings in us: Init: 31930, First inference: 40037, Warmup (avg): 22013.8, Inference (avg): 33329.7
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=74.5625 overall=404.5

You will need the prebuilt binary for this tool which can be downloaded from https://www.tensorflow.org/lite/performance/measurement and also the ArmNN tflite delegate binary. For more information abou ArmNN please see https://github.com/ARM-software/armnn

Hope this helps

wenhyan commented 1 year ago

@morgolock Hi, thx, what's hardware used for you benchmark? A53? Does the data Layout(NHWC) have a big impact on performance?

morgolock commented 1 year ago

Hi @wenhyan

Does the data Layout(NHWC) have a big impact on performance?

Yes, for the best performance use NHWC. The layout NCHW is no longer being optimized.

what's hardware used for you benchmark?

I've used a HiKey960 board.

Hope this helps.

wenhyan commented 1 year ago

@morgolock thank you very much!

wenhyan commented 1 year ago

@morgolock Hi, I want to implement CNN by neon api and set the data layout to NHWC.

class NEONCNNExample : public Example
{
    public:
    bool do_setup(int argc, char **argv) override
    {
        ARM_COMPUTE_UNUSED(argc);
        ARM_COMPUTE_UNUSED(argv);

        // Set memory manager where allowed to manage internal memory requirements
        conv0 = std::make_unique<NEDirectConvolutionLayer>();
        /* [Initialize tensors] */

        const TensorShape src_shape(320, 320, 3, 1);
        src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32, DataLayout::NHWC));

        const TensorShape weights_shape_conv0(3, 3, 3, 24);
        const TensorShape biases_shape_conv0(24);
        const TensorShape out_shape_conv0(160, 160, 24, 1);

        weights0.allocator()->init(TensorInfo(weights_shape_conv0, 1, DataType::F32));
        biases0.allocator()->init(TensorInfo(biases_shape_conv0, 1, DataType::F32));
        out_conv0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));

        /* [Configure functions] */

        // in:32x32x1: 5x5 convolution, 8 output features maps (OFM)
        conv0->configure(&src, &weights0, &biases0, &out_conv0, PadStrideInfo(2 /* stride_x */, 2 /* stride_y */, 1 /* pad_x */, 1 /* pad_y */));

        out_conv0.allocator()->allocate();

        /* [Allocate tensors] */

        // Now that the padding requirements are known we can allocate all tensors
        src.allocator()->allocate();
        weights0.allocator()->allocate();
        biases0.allocator()->allocate();

        printf("layout : %d\n", (int)(src.allocator()->info().data_layout()));
        return true;
    }
    void do_run() override
    {
        conv0->run();
    }

    private:
    // The src tensor should contain the input image
    Tensor src{};

    // Intermediate tensors used
    Tensor weights0{};
    Tensor biases0{};
    Tensor out_conv0{};

    // conv1
    // Layers
    std::unique_ptr<NEDirectConvolutionLayer> conv0{};
}; 

Is there any problem? result

./benchmark_neon_psd --instruments=NONE --instruments=WALL_CLOCK_TIMER --color-output --iterations=100 --example_args=--layout=NHWC
Version = arm_compute_version=v23.05.1 Build options: {'Werror': '1', 'debug': '0', 'asserts': '1', 'neon': '1', 'opencl': '0', 'os': 'linux', 'arch': 'armv8a', 'benchmark_examples': '1', 'benchmark_tests': '1', 'build_dir': '/home/aey1wx/workspace/RB-PLD/build_acl', 'install_dir': '/home/aey1wx/workspace/RB-PLD/install_acl'} Git hash=b'2b2ffe758dfff7255cf459a7eab26cb8aeff3061'
CommandLine = ./benchmark_neon_psd --instruments=NONE --instruments=WALL_CLOCK_TIMER --color-output --iterations=100 --example_args=--layout=NHWC 
Iterations = 100
Running [0] 'Examples/benchmark_neon_psd'
ERROR: in validate_arguments src/cpu/kernels/CpuDirectConv2dKernel.cpp:73: weights->dimension(channel_idx) != src->dimension(channel_idx)
Executed 1 test(s) (0 passed, 0 expected failures, 0 failed, 1 crashed, 0 disabled) in 0 second(s)

If I set layout to NCWH, Its not error.

morgolock commented 1 year ago

Hi @wenhyan

Just setting the layout to NHWC is not enough, you need to specify the shape in NHWC too

Intead of const TensorShape src_shape(320, 320, 3, 1); use const TensorShape src_shape(3, 320, 320, 1);

The output tensor needs these changes too.

You can try the code below in your example:

 48         const TensorShape src_shape(3, 320, 320, 1);
 49         const TensorShape weights_shape_conv0(3, 3, 3, 24);
 50         const TensorShape biases_shape_conv0(24);
 51         const TensorShape out_shape_conv0(24, 160, 160, 1);
 52         
 53         src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32, DataLayout::NHWC));
 54         weights0.allocator()->init(TensorInfo(weights_shape_conv0, 1, DataType::F32));
 55         biases0.allocator()->init(TensorInfo(biases_shape_conv0, 1, DataType::F32));
 56         out_conv0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32, DataLayout::NHWC));
 57         

Hope this helps.

wenhyan commented 1 year ago

@morgolock thank you. got it Another problem, What is difference between the GEMM and GEMM_CONV2d in the ConvolutionLayer kernels? thx.

morgolock commented 1 year ago

Hi @wenhyan

These two functions have different purposes:

Please see the documentation for more details: https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/runtime/NEON/functions/NEGEMM.h https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/runtime/NEON/functions/NEGEMMConv2d.h

Hope this helps

wenhyan commented 1 year ago

@morgolock Sorry! I mean CpuGemmConv2d and CpuGemmDirectConv2 in Convolution layer.

case ConvolutionMethod::GEMM:
{
    auto f = std::make_unique<CpuGemmConv2d>();
    f->configure(input, weights, biases, output, conv_info, weights_info, dilation, act_info, enable_fast_math);
    _function = std::move(f);
    break;
}
case ConvolutionMethod::GEMM_CONV2D:
{
    auto f = std::make_unique<CpuGemmDirectConv2d>();
    f->configure(input, weights, biases, output, info);
    _function = std::move(f);
    break;
}

Which one has better performence, and what conditions choose CpuGemmDirectConv2d?

morgolock commented 1 year ago

Hi @wenhyan

You don't have to worry about what method to choose, the operator CpuConv2d will choose the best method for the given workload. The method selected depends on the shapes and other details of the tensors passed in.

For more details about the heuristic used to select the convolution method please see https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuConv2d.cpp#L140

Hope this helps.