ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.76k stars 767 forks source link

Huge Performance Differences between Graph API and direct function call #1009

Closed GGGGxxxxxxxxr closed 1 year ago

GGGGxxxxxxxxr commented 1 year ago

Hi!

I have tried to implement neural networks on Android based on Arm Compute Library. The case is that: When I tried to use direct function call to build a network such as Alexnet, the inference time is much longer than the speed performance of examples/graph_alexnet.cpp.

Here is part of the code how I implement AlexNet with direct function call:

// Tensor Definition constexpr unsigned int input_width = 227; constexpr unsigned int input_height = 227; constexpr unsigned int input_fm = 3;

const TensorShape input_shape(input_width, input_height, input_fm);
input.allocator() -> init(TensorInfo(input_shape, 1, DataType::F32));

//init_conv_1
constexpr unsigned int conv_1_kernel_x = 11;
constexpr unsigned int conv_1_kernel_y = 11;
constexpr unsigned int conv_1_fm       = 96;
constexpr unsigned int conv_1_out      = 55;

const TensorShape conv_1_weights_shape(conv_1_kernel_x, conv_1_kernel_y, input_shape.z(), conv_1_fm);
const TensorShape conv_1_biases_shape(conv_1_weights_shape[3]);
const TensorShape conv_1_out_shape(conv_1_out, conv_1_out, conv_1_weights_shape[3]);

weights_1.allocator() -> init(TensorInfo(conv_1_weights_shape, 1, DataType::F32));
biases_1.allocator() -> init(TensorInfo(conv_1_biases_shape, 1, DataType::F32));
out_1.allocator() -> init(TensorInfo(conv_1_out_shape, 1, DataType::F32));

act_1.allocator() -> init(TensorInfo(conv_1_out_shape, 1, DataType::F32));

....

//Layer Definition
//in: 227 * 227 * 3, kernel: 11 * 11 * 3 * 96, out: 55 * 55 * 96
conv_1.configure(&input, &weights_1, &biases_1, &out_1, PadStrideInfo(4, 4, 0, 0));

....

//Run the model gettimeofday(&start, NULL); conv_1.run(); Nact_1.run(); Npool_1.run(); LRN_1.run();

    //conv_2
    // conv_2.run();
    // Nact_2.run();
    conv_2.run();
    Nact_2.run();
    Npool_2.run();
    LRN_2.run();

    //conv_3
    conv_3.run();
    Nact_3.run();

    //conv_4
    conv_4.run();
    Nact_4.run();

    //conv_5
    // conv_5.run();
    // Nact_5.run();
    conv_5.run();
    Nact_5.run();
    Npool_5.run();

    //fc_8
    fc_8.run();
    softmax.run();
    gettimeofday(&end, NULL);

I have implemented this on my Realm X5Pro and with my personal function call method, it took 70ms on average for one inference but the graph_alexnet benchmark is much faster.

I have made the comments on the source codes in src/cpu/operators and I have noticed that both methods have called the same backend sources, but why there is a huge difference in the speed performance?

Thank!

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

Normally the first iteration is much slower than the other iterations. Make sure you run multiple iterations and measure the time for each one.

for (int j=0; j < num_iterations; ++j)
{
    gettimeofday(&start, NULL);
    conv_1.run();
    Nact_1.run();
    Npool_1.run();
    LRN_1.run();

    //conv_2
    // conv_2.run();
    // Nact_2.run();
    conv_2.run();
    Nact_2.run();
    Npool_2.run();
    LRN_2.run();

    //conv_3
    conv_3.run();
    Nact_3.run();

    //conv_4
    conv_4.run();
    Nact_4.run();

    //conv_5
    // conv_5.run();
    // Nact_5.run();
    conv_5.run();
    Nact_5.run();
    Npool_5.run();

    //fc_8
    fc_8.run();
    softmax.run();
    gettimeofday(&end, NULL);
   std::cout << "Iteration " << j << compute_elapsed(start,end) << std::endl;
}

Hope this helps.

GGGGxxxxxxxxr commented 1 year ago

Yes, I have tried 50 iterations and the first 5 iters served as the warm-up. The average computation time of the direct function call is still much longer than the Graph API. I would like to know whether the Graph API has some special optimizations when calling the src backend.

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

No, there should be no significant difference in the execution time between the function and the graph api. The graph api is just a thin layer on the top of the functions which is more expressive and allows to create a network writing less code to than when using just functions. All the optimizations are at the kernel level and there are no optimizations at the graph level.

I think the workloads you are comparing may not be exactly the same? Why don't you just try comparing one layer ?

GGGGxxxxxxxxr commented 1 year ago

Hi! Thanks for your reply! I have just made a very simple comparison on the provided examples, first one is /examples/neon_cnn.cpp and I have implemented this simple CNN via Graph API based on the graph_vgg16.cpp file.

Here is my implementation for this simple CNN via Graph API:

#include "arm_compute/graph.h"
#include "support/ToolchainSupport.h"
#include "utils/CommonGraphOptions.h"
#include "utils/GraphUtils.h"
#include "utils/Utils.h"

using namespace arm_compute::utils;
using namespace arm_compute::graph::frontend;
using namespace arm_compute::graph_utils;

/** Example demonstrating how to implement VGG16's network using the Compute Library's graph API */
class GraphVGG16Example : public Example
{
public:
    GraphVGG16Example()
        : cmd_parser(), common_opts(cmd_parser), common_params(), graph(0, "VGG16")
    {
    }

    std::unique_ptr<arm_compute::graph::ITensorAccessor> dummy() 
    {
        return std::make_unique<DummyAccessor>(1);
    }

    bool do_setup(int argc, char **argv) override
    {
        // Parse arguments
        cmd_parser.parse(argc, argv);
        cmd_parser.validate();

        // Consume common parameters
        common_params = consume_common_graph_parameters(common_opts);

        // Return when help menu is requested
        if(common_params.help)
        {
            cmd_parser.print_help(argv[0]);
            return false;
        }

        // Print parameter values
        std::cout << common_params << std::endl;

        // Get trainable parameters data path
        std::string data_path = common_params.data_path;

        // Create a preprocessor object
        const std::array<float, 3> mean_rgb{ { 123.68f, 116.779f, 103.939f } };
        std::unique_ptr<IPreprocessor> preprocessor = std::make_unique<CaffePreproccessor>(mean_rgb);

        // Create input descriptor
        const auto        operation_layout = DataLayout::NCHW;
        const TensorShape tensor_shape     = permute_shape(TensorShape(32U, 32U, 1U, common_params.batches), DataLayout::NCHW, operation_layout);
        TensorDescriptor  input_descriptor = TensorDescriptor(tensor_shape, common_params.data_type).set_layout(operation_layout);

        // Set weights trained layout
        const DataLayout weights_layout = DataLayout::NCHW;

        // Create graph
        graph << common_params.target
              << common_params.fast_math_hint
              << InputLayer(input_descriptor, dummy())
              // Layer 1
              << ConvolutionLayer(
                  5U, 5U, 8U,
                  dummy(),
                  dummy(),
                  PadStrideInfo(1, 1, 2, 2))
              .set_name("conv1_1")

              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU)).set_name("conv1_1/Relu")
              << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, operation_layout, PadStrideInfo(2, 2, 0, 0))).set_name("pool1")
              // Layer 2
              << ConvolutionLayer(
                  3U, 3U, 16U,
                  dummy(),
                  dummy(),
                  PadStrideInfo(1, 1, 1, 1))
              .set_name("conv1_2")
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU)).set_name("conv1_2/Relu")
              << PoolingLayer(PoolingLayerInfo(PoolingType::AVG, 2, operation_layout, PadStrideInfo(2, 2, 0, 0))).set_name("pool2")
            //Fully Connected
              << FullyConnectedLayer(
                  128U,
                  dummy(),
                  dummy())
              .set_name("fc")
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU)).set_name("conv3_1/Relu")
              << SoftmaxLayer().set_name("prob")
              << OutputLayer(get_output_accessor(common_params, 5));

        // Finalize graph
        GraphConfig config;
        config.num_threads        = 1;
        config.use_tuner          = common_params.enable_tuner;
        config.tuner_mode         = common_params.tuner_mode;
        config.tuner_file         = common_params.tuner_file;
        config.mlgo_file          = common_params.mlgo_file;
        config.use_synthetic_type = arm_compute::is_data_type_quantized(common_params.data_type);
        config.synthetic_type     = common_params.data_type;

        graph.finalize(common_params.target, config);

        return true;
    }

    void do_run() override
    {
        std::cout<<"run cnn model...\n";

        // Run graph
        struct timeval start, end;
        gettimeofday(&start, NULL);
        graph.run();
        gettimeofday(&end, NULL);

        std::cout << std::endl << std::endl << std::endl;
        int timeuse = 1000000 * ( end.tv_sec - start.tv_sec ) + end.tv_usec -start.tv_usec;

        printf("time: %d us\n", timeuse);
    }

private:
    CommandLineParser  cmd_parser;
    CommonGraphOptions common_opts;
    CommonGraphParams  common_params;
    Stream             graph;
};

int main(int argc, char **argv)
{
    return arm_compute::utils::run_example<GraphVGG16Example>(argc, argv);
}

I have noticed there is still a huge gap between these two results. I am not sure whether I have correctly used the Graph API. I have set all the tensor accessor as dummy tensor.

Really Appreciate for the help!

GGGGxxxxxxxxr commented 1 year ago

The speed bench mark is shown in this picture: 1

I have edited the operation_layout in Graph api to 'NCHW' to make sure both of the direct function call and graph api operate on the same tensor layout.

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

It's difficult to speculate about this without seeing the full source code, build options and execution commands. I think there is a problem in the way the elapsed time is measured. The graph api just calls into the functions to execute the workloads so there must be no significant performance difference between the two approaches.

In order to understand what is going on you will need to measure individual layers and identify where the difference is coming from.

If you build the library with benchmark_examples=1 then you can use the instruments to look into the graph example performance

acl/main# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--target=NEON,--fast-math'
Version = arm_compute_version=v0.0-unreleased Build options: {'standalone': '1', 'test_filter': 'ActivationLayer.cpp', 'opencl': '0', 'neon': '1', 'validation_tests': '1', 'examples': '0', 'debug': '1', 'arch': 'armv8a', 'benchmark_examples': '1'} Git hash=e112ef1cc70bcdc52ded44350e61eb16d74559b3
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math 
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : true

  SchedulerTimer/Conv+Conv/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #2:    AVG=2.3420 ms
  SchedulerTimer/Conv+Conv/BatchNorm/CpuIm2ColKernel #1:    AVG=12.4220 ms
  SchedulerTimer/Conv+Conv/BatchNorm/CpuWeightsReshapeKernel #0:    AVG=0.1020 ms
  SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #97:    AVG=3.6140 ms
  SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuWeightsReshapeKernel #96:    AVG=15.0640 ms
  SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #98:    AVG=0.1920 ms
  SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #99:    AVG=0.7020 ms
  SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #100:    AVG=1.1760 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_fp32_logits_1d_max #101:    AVG=0.0270 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_fp32_softmax_logits_1d #102:    AVG=0.1710 ms
  SchedulerTimer/expanded_conv/depthwise/depthwise+expanded_conv/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #3:    AVG=1.8800 ms
  SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #5:    AVG=1.1850 ms
  SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuWeightsReshapeKernel #4:    AVG=0.0640 ms
  SchedulerTimer/expanded_conv_1/depthwise/depthwise+expanded_conv_1/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #8:    AVG=2.4930 ms
  SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_smallK_hybrid_fp32_mla_6x4 #7:    AVG=5.1230 ms
  SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuWeightsReshapeKernel #6:    AVG=0.2100 ms
  SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #10:    AVG=1.3540 ms
  SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuWeightsReshapeKernel #9:    AVG=0.1300 ms
  SchedulerTimer/expanded_conv_10/depthwise/depthwise+expanded_conv_10/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #59:    AVG=0.3390 ms
  SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #58:    AVG=0.9590 ms
  SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuWeightsReshapeKernel #57:    AVG=1.3400 ms
  SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #61:    AVG=1.3400 ms
  SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuWeightsReshapeKernel #60:    AVG=1.3500 ms
  SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_fp32_add #67:    AVG=0.2580 ms
  SchedulerTimer/expanded_conv_11/depthwise/depthwise+expanded_conv_11/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #64:    AVG=0.6260 ms
  SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #63:    AVG=2.0200 ms
  SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuWeightsReshapeKernel #62:    AVG=2.6450 ms
  SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #66:    AVG=1.8870 ms
  SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuWeightsReshapeKernel #65:    AVG=1.9040 ms
  SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_fp32_add #73:    AVG=0.2820 ms
  SchedulerTimer/expanded_conv_12/depthwise/depthwise+expanded_conv_12/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #70:    AVG=0.5730 ms
  SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #69:    AVG=2.0760 ms
  SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuWeightsReshapeKernel #68:    AVG=2.6250 ms
  SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #72:    AVG=1.8460 ms
  SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuWeightsReshapeKernel #71:    AVG=1.9330 ms
  SchedulerTimer/expanded_conv_13/depthwise/depthwise+expanded_conv_13/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #76:    AVG=0.2680 ms
  SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #75:    AVG=2.0940 ms
  SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuWeightsReshapeKernel #74:    AVG=2.6290 ms
  SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #78:    AVG=0.8540 ms
  SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuWeightsReshapeKernel #77:    AVG=3.1880 ms
  SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_fp32_add #84:    AVG=0.1270 ms
  SchedulerTimer/expanded_conv_14/depthwise/depthwise+expanded_conv_14/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #81:    AVG=0.2770 ms
  SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #80:    AVG=1.4400 ms
  SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuWeightsReshapeKernel #79:    AVG=6.3610 ms
  SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #83:    AVG=1.4390 ms
  SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuWeightsReshapeKernel #82:    AVG=5.1470 ms
  SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_fp32_add #90:    AVG=0.1260 ms
  SchedulerTimer/expanded_conv_15/depthwise/depthwise+expanded_conv_15/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #87:    AVG=0.2780 ms
  SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #86:    AVG=1.4400 ms
  SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuWeightsReshapeKernel #85:    AVG=6.3720 ms
  SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #89:    AVG=1.4230 ms
  SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuWeightsReshapeKernel #88:    AVG=5.1430 ms
  SchedulerTimer/expanded_conv_16/depthwise/depthwise+expanded_conv_16/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #93:    AVG=0.2750 ms
  SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #92:    AVG=1.4690 ms
  SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuWeightsReshapeKernel #91:    AVG=6.3670 ms
  SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #95:    AVG=2.7300 ms
  SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuWeightsReshapeKernel #94:    AVG=10.2600 ms
  SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_fp32_add #16:    AVG=0.9310 ms
  SchedulerTimer/expanded_conv_2/depthwise/depthwise+expanded_conv_2/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #13:    AVG=2.4990 ms
  SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #12:    AVG=2.4790 ms
  SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuWeightsReshapeKernel #11:    AVG=0.3390 ms
  SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #15:    AVG=2.1000 ms
  SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuWeightsReshapeKernel #14:    AVG=0.1660 ms
  SchedulerTimer/expanded_conv_3/depthwise/depthwise+expanded_conv_3/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #19:    AVG=1.0300 ms
  SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #18:    AVG=2.4800 ms
  SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuWeightsReshapeKernel #17:    AVG=0.3360 ms
  SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #21:    AVG=0.6930 ms
  SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuWeightsReshapeKernel #20:    AVG=0.2140 ms
  SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_fp32_add #27:    AVG=0.3260 ms
  SchedulerTimer/expanded_conv_4/depthwise/depthwise+expanded_conv_4/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #24:    AVG=0.7980 ms
  SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #23:    AVG=1.0150 ms
  SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuWeightsReshapeKernel #22:    AVG=0.4820 ms
  SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #26:    AVG=0.8960 ms
  SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuWeightsReshapeKernel #25:    AVG=0.2610 ms
  SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_fp32_add #33:    AVG=0.3260 ms
  SchedulerTimer/expanded_conv_5/depthwise/depthwise+expanded_conv_5/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #30:    AVG=0.7950 ms
  SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #29:    AVG=1.0570 ms
  SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuWeightsReshapeKernel #28:    AVG=0.5120 ms
  SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #32:    AVG=0.9380 ms
  SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuWeightsReshapeKernel #31:    AVG=0.2580 ms
  SchedulerTimer/expanded_conv_6/depthwise/depthwise+expanded_conv_6/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #36:    AVG=0.2630 ms
  SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #35:    AVG=1.0330 ms
  SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuWeightsReshapeKernel #34:    AVG=0.4840 ms
  SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #38:    AVG=0.4170 ms
  SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuWeightsReshapeKernel #37:    AVG=0.4970 ms
  SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_fp32_add #44:    AVG=0.1770 ms
  SchedulerTimer/expanded_conv_7/depthwise/depthwise+expanded_conv_7/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #41:    AVG=0.3420 ms
  SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #40:    AVG=0.9150 ms
  SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuWeightsReshapeKernel #39:    AVG=1.3650 ms
  SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #43:    AVG=0.9360 ms
  SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuWeightsReshapeKernel #42:    AVG=0.9110 ms
  SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_fp32_add #50:    AVG=0.1770 ms
  SchedulerTimer/expanded_conv_8/depthwise/depthwise+expanded_conv_8/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #47:    AVG=0.3850 ms
  SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #46:    AVG=0.9580 ms
  SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuWeightsReshapeKernel #45:    AVG=1.3400 ms
  SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #49:    AVG=0.9050 ms
  SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuWeightsReshapeKernel #48:    AVG=0.8890 ms
  SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_fp32_add #56:    AVG=0.2040 ms
  SchedulerTimer/expanded_conv_9/depthwise/depthwise+expanded_conv_9/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #53:    AVG=0.3510 ms
  SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #52:    AVG=0.9690 ms
  SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuWeightsReshapeKernel #51:    AVG=1.3690 ms
  SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #55:    AVG=0.9100 ms
  SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuWeightsReshapeKernel #54:    AVG=0.8850 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)

Please note that NHWC performance is better than NCHW.

Hope this helps.

GGGGxxxxxxxxr commented 1 year ago

That really helps me! Such benchmark tool is very helpful for performance debugging!! Thanks!