Closed wenhyan closed 1 year ago
Hi @wenhyan
If you build the library with benchmark_examples=1
then you can use the instruments to look into the graph examples performance as shown below:
~/tmp/acl_mt# LD_LIBRARY_PATH=./main_release-logging/:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math,--threads=1
Version = e6209e1df1094b582cd427c81fc289a42c495ad6
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math,--threads=1
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file :
MLGO file :
Fast math enabled? : true
SchedulerTimer/Conv+Conv/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #2: AVG=2.5130 ms
SchedulerTimer/Conv+Conv/BatchNorm/CpuIm2ColKernel #1: AVG=3.5220 ms
SchedulerTimer/Conv+Conv/BatchNorm/CpuWeightsReshapeKernel #0: AVG=0.0410 ms
SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #97: AVG=4.4310 ms
SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuWeightsReshapeKernel #96: AVG=8.6160 ms
SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #98: AVG=0.3090 ms
SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #99: AVG=1.1030 ms
SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #100: AVG=0.1070 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_fp32_logits_1d_max #101: AVG=0.2070 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_fp32_softmax_logits_1d #102: AVG=0.0890 ms
SchedulerTimer/expanded_conv/depthwise/depthwise+expanded_conv/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #3: AVG=3.0900 ms
SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #5: AVG=1.3380 ms
SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuWeightsReshapeKernel #4: AVG=0.0330 ms
SchedulerTimer/expanded_conv_1/depthwise/depthwise+expanded_conv_1/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #8: AVG=3.4450 ms
SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_smallK_hybrid_fp32_mla_6x4 #7: AVG=5.2310 ms
SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuWeightsReshapeKernel #6: AVG=0.0510 ms
SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #10: AVG=1.6070 ms
SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuWeightsReshapeKernel #9: AVG=0.0610 ms
SchedulerTimer/expanded_conv_10/depthwise/depthwise+expanded_conv_10/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #59: AVG=0.2540 ms
SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #58: AVG=1.0320 ms
SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuWeightsReshapeKernel #57: AVG=0.5140 ms
SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #61: AVG=1.4140 ms
SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuWeightsReshapeKernel #60: AVG=0.7220 ms
SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_fp32_add #67: AVG=0.0310 ms
SchedulerTimer/expanded_conv_11/depthwise/depthwise+expanded_conv_11/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #64: AVG=0.5550 ms
SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #63: AVG=2.3590 ms
SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuWeightsReshapeKernel #62: AVG=1.1460 ms
SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #66: AVG=2.2200 ms
SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuWeightsReshapeKernel #65: AVG=1.0830 ms
SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_fp32_add #73: AVG=0.0280 ms
SchedulerTimer/expanded_conv_12/depthwise/depthwise+expanded_conv_12/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #70: AVG=0.4280 ms
SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #69: AVG=2.5380 ms
SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuWeightsReshapeKernel #68: AVG=1.1480 ms
SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #72: AVG=2.1850 ms
SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuWeightsReshapeKernel #71: AVG=1.0840 ms
SchedulerTimer/expanded_conv_13/depthwise/depthwise+expanded_conv_13/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #76: AVG=0.2700 ms
SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #75: AVG=2.4880 ms
SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuWeightsReshapeKernel #74: AVG=1.1470 ms
SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #78: AVG=1.0430 ms
SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuWeightsReshapeKernel #77: AVG=1.8020 ms
SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_fp32_add #84: AVG=0.0230 ms
SchedulerTimer/expanded_conv_14/depthwise/depthwise+expanded_conv_14/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #81: AVG=0.2520 ms
SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #80: AVG=1.6480 ms
SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuWeightsReshapeKernel #79: AVG=3.1190 ms
SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #83: AVG=1.6420 ms
SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuWeightsReshapeKernel #82: AVG=3.0550 ms
SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_fp32_add #90: AVG=0.0220 ms
SchedulerTimer/expanded_conv_15/depthwise/depthwise+expanded_conv_15/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #87: AVG=0.2840 ms
SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #86: AVG=1.6900 ms
SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuWeightsReshapeKernel #85: AVG=3.1830 ms
SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #89: AVG=1.6320 ms
SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuWeightsReshapeKernel #88: AVG=2.9910 ms
SchedulerTimer/expanded_conv_16/depthwise/depthwise+expanded_conv_16/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #93: AVG=0.2840 ms
SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #92: AVG=1.6430 ms
SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuWeightsReshapeKernel #91: AVG=3.1740 ms
SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #95: AVG=3.3840 ms
SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuWeightsReshapeKernel #94: AVG=6.1190 ms
SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_fp32_add #16: AVG=0.1270 ms
SchedulerTimer/expanded_conv_2/depthwise/depthwise+expanded_conv_2/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #13: AVG=2.8090 ms
SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #12: AVG=2.4810 ms
SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuWeightsReshapeKernel #11: AVG=0.0850 ms
SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #15: AVG=2.4890 ms
SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuWeightsReshapeKernel #14: AVG=0.0850 ms
SchedulerTimer/expanded_conv_3/depthwise/depthwise+expanded_conv_3/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #19: AVG=0.9390 ms
SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #18: AVG=2.4380 ms
SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuWeightsReshapeKernel #17: AVG=0.0880 ms
SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #21: AVG=0.7390 ms
SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuWeightsReshapeKernel #20: AVG=0.1070 ms
SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_fp32_add #27: AVG=0.0320 ms
SchedulerTimer/expanded_conv_4/depthwise/depthwise+expanded_conv_4/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #24: AVG=0.5840 ms
SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #23: AVG=1.0290 ms
SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuWeightsReshapeKernel #22: AVG=0.2110 ms
SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #26: AVG=0.9900 ms
SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuWeightsReshapeKernel #25: AVG=0.1250 ms
SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_fp32_add #33: AVG=0.0300 ms
SchedulerTimer/expanded_conv_5/depthwise/depthwise+expanded_conv_5/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #30: AVG=0.5200 ms
SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #29: AVG=1.1130 ms
SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuWeightsReshapeKernel #28: AVG=0.1400 ms
SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #32: AVG=0.9270 ms
SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuWeightsReshapeKernel #31: AVG=0.1260 ms
SchedulerTimer/expanded_conv_6/depthwise/depthwise+expanded_conv_6/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #36: AVG=0.2770 ms
SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #35: AVG=1.0730 ms
SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuWeightsReshapeKernel #34: AVG=0.1390 ms
SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #38: AVG=0.4550 ms
SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuWeightsReshapeKernel #37: AVG=0.2470 ms
SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_fp32_add #44: AVG=0.0200 ms
SchedulerTimer/expanded_conv_7/depthwise/depthwise+expanded_conv_7/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #41: AVG=0.3410 ms
SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #40: AVG=1.0210 ms
SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuWeightsReshapeKernel #39: AVG=0.5130 ms
SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #43: AVG=1.0330 ms
SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuWeightsReshapeKernel #42: AVG=0.4840 ms
SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_fp32_add #50: AVG=0.0160 ms
SchedulerTimer/expanded_conv_8/depthwise/depthwise+expanded_conv_8/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #47: AVG=0.2520 ms
SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #46: AVG=1.0650 ms
SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuWeightsReshapeKernel #45: AVG=0.5130 ms
SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #49: AVG=0.9930 ms
SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuWeightsReshapeKernel #48: AVG=0.4850 ms
SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_fp32_add #56: AVG=0.0180 ms
SchedulerTimer/expanded_conv_9/depthwise/depthwise+expanded_conv_9/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #53: AVG=0.2520 ms
SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #52: AVG=1.0380 ms
SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuWeightsReshapeKernel #51: AVG=0.5130 ms
SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #55: AVG=0.9770 ms
SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuWeightsReshapeKernel #54: AVG=0.4840 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)
Alternatively you could use the tensorflow benchmark tool and the armnn delegate as shown below:
$ LD_LIBRARY_PATH=./armnn/main/:$LD_LIBRARY_PATH ./linux_aarch64_benchmark_model --graph=./wdsr_960.tflite --num_threads=4 --num_runs=120 --warmup_runs=1 --external_delegate_path="armnn/main/libarmnnDelegate.so" --external_delegate_options="backends:CpuAcc"
STARTING!
Log parameter values verbosely: [0]
Min num runs: [120]
Num threads: [4]
Min warmup runs: [1]
Graph: [./wdsr_960.tflite]
#threads used for CPU inference: [4]
External delegate path: [armnn/main/libarmnnDelegate.so]
External delegate options: [backends:CpuAcc]
Loaded model ./wdsr_960.tflite
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 0.011828
Initialized session in 31.93ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=23 first=40037 curr=26812 min=18118 max=40037 avg=22013.8 std=4449
Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=120 first=25867 curr=37586 min=24979 max=46407 avg=33329.7 std=2946
Inference timings in us: Init: 31930, First inference: 40037, Warmup (avg): 22013.8, Inference (avg): 33329.7
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=74.5625 overall=404.5
You will need the prebuilt binary for this tool which can be downloaded from https://www.tensorflow.org/lite/performance/measurement and also the ArmNN tflite delegate binary. For more information abou ArmNN please see https://github.com/ARM-software/armnn
Hope this helps
@morgolock Hi, thx, what's hardware used for you benchmark? A53? Does the data Layout(NHWC) have a big impact on performance?
Hi @wenhyan
Does the data Layout(NHWC) have a big impact on performance?
Yes, for the best performance use NHWC. The layout NCHW is no longer being optimized.
what's hardware used for you benchmark?
I've used a HiKey960 board.
Hope this helps.
@morgolock thank you very much!
@morgolock Hi, I want to implement CNN by neon api and set the data layout to NHWC.
class NEONCNNExample : public Example
{
public:
bool do_setup(int argc, char **argv) override
{
ARM_COMPUTE_UNUSED(argc);
ARM_COMPUTE_UNUSED(argv);
// Set memory manager where allowed to manage internal memory requirements
conv0 = std::make_unique<NEDirectConvolutionLayer>();
/* [Initialize tensors] */
const TensorShape src_shape(320, 320, 3, 1);
src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32, DataLayout::NHWC));
const TensorShape weights_shape_conv0(3, 3, 3, 24);
const TensorShape biases_shape_conv0(24);
const TensorShape out_shape_conv0(160, 160, 24, 1);
weights0.allocator()->init(TensorInfo(weights_shape_conv0, 1, DataType::F32));
biases0.allocator()->init(TensorInfo(biases_shape_conv0, 1, DataType::F32));
out_conv0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));
/* [Configure functions] */
// in:32x32x1: 5x5 convolution, 8 output features maps (OFM)
conv0->configure(&src, &weights0, &biases0, &out_conv0, PadStrideInfo(2 /* stride_x */, 2 /* stride_y */, 1 /* pad_x */, 1 /* pad_y */));
out_conv0.allocator()->allocate();
/* [Allocate tensors] */
// Now that the padding requirements are known we can allocate all tensors
src.allocator()->allocate();
weights0.allocator()->allocate();
biases0.allocator()->allocate();
printf("layout : %d\n", (int)(src.allocator()->info().data_layout()));
return true;
}
void do_run() override
{
conv0->run();
}
private:
// The src tensor should contain the input image
Tensor src{};
// Intermediate tensors used
Tensor weights0{};
Tensor biases0{};
Tensor out_conv0{};
// conv1
// Layers
std::unique_ptr<NEDirectConvolutionLayer> conv0{};
};
Is there any problem? result
./benchmark_neon_psd --instruments=NONE --instruments=WALL_CLOCK_TIMER --color-output --iterations=100 --example_args=--layout=NHWC
Version = arm_compute_version=v23.05.1 Build options: {'Werror': '1', 'debug': '0', 'asserts': '1', 'neon': '1', 'opencl': '0', 'os': 'linux', 'arch': 'armv8a', 'benchmark_examples': '1', 'benchmark_tests': '1', 'build_dir': '/home/aey1wx/workspace/RB-PLD/build_acl', 'install_dir': '/home/aey1wx/workspace/RB-PLD/install_acl'} Git hash=b'2b2ffe758dfff7255cf459a7eab26cb8aeff3061'
CommandLine = ./benchmark_neon_psd --instruments=NONE --instruments=WALL_CLOCK_TIMER --color-output --iterations=100 --example_args=--layout=NHWC
Iterations = 100
Running [0] 'Examples/benchmark_neon_psd'
ERROR: in validate_arguments src/cpu/kernels/CpuDirectConv2dKernel.cpp:73: weights->dimension(channel_idx) != src->dimension(channel_idx)
Executed 1 test(s) (0 passed, 0 expected failures, 0 failed, 1 crashed, 0 disabled) in 0 second(s)
If I set layout to NCWH, Its not error.
Hi @wenhyan
Just setting the layout to NHWC is not enough, you need to specify the shape in NHWC too
Intead of const TensorShape src_shape(320, 320, 3, 1);
use const TensorShape src_shape(3, 320, 320, 1);
The output tensor needs these changes too.
You can try the code below in your example:
48 const TensorShape src_shape(3, 320, 320, 1);
49 const TensorShape weights_shape_conv0(3, 3, 3, 24);
50 const TensorShape biases_shape_conv0(24);
51 const TensorShape out_shape_conv0(24, 160, 160, 1);
52
53 src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32, DataLayout::NHWC));
54 weights0.allocator()->init(TensorInfo(weights_shape_conv0, 1, DataType::F32));
55 biases0.allocator()->init(TensorInfo(biases_shape_conv0, 1, DataType::F32));
56 out_conv0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32, DataLayout::NHWC));
57
Hope this helps.
@morgolock thank you. got it Another problem, What is difference between the GEMM and GEMM_CONV2d in the ConvolutionLayer kernels? thx.
Hi @wenhyan
These two functions have different purposes:
NEGEMM
computes general matrix multiplicationNEGEMMConv2d
is used to compute the convolution layer using GEMM as the convolution methodPlease see the documentation for more details: https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/runtime/NEON/functions/NEGEMM.h https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/runtime/NEON/functions/NEGEMMConv2d.h
Hope this helps
@morgolock Sorry! I mean CpuGemmConv2d and CpuGemmDirectConv2 in Convolution layer.
case ConvolutionMethod::GEMM:
{
auto f = std::make_unique<CpuGemmConv2d>();
f->configure(input, weights, biases, output, conv_info, weights_info, dilation, act_info, enable_fast_math);
_function = std::move(f);
break;
}
case ConvolutionMethod::GEMM_CONV2D:
{
auto f = std::make_unique<CpuGemmDirectConv2d>();
f->configure(input, weights, biases, output, info);
_function = std::move(f);
break;
}
Which one has better performence, and what conditions choose CpuGemmDirectConv2d?
Hi @wenhyan
You don't have to worry about what method to choose, the operator CpuConv2d
will choose the best method for the given workload. The method selected depends on the shapes and other details of the tensors passed in.
For more details about the heuristic used to select the convolution method please see https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuConv2d.cpp#L140
Hope this helps.
Hi, Do you have some benchmark data by one cpu core and one thread? Or if I want to do some benchmark, what to do that?