Closed GGGGxxxxxxxxr closed 1 year ago
Hi @GGGGxxxxxxxxr
Normally the first iteration is much slower than the other iterations. Make sure you run multiple iterations and measure the time for each one.
for (int j=0; j < num_iterations; ++j)
{
gettimeofday(&start, NULL);
conv_1.run();
Nact_1.run();
Npool_1.run();
LRN_1.run();
//conv_2
// conv_2.run();
// Nact_2.run();
conv_2.run();
Nact_2.run();
Npool_2.run();
LRN_2.run();
//conv_3
conv_3.run();
Nact_3.run();
//conv_4
conv_4.run();
Nact_4.run();
//conv_5
// conv_5.run();
// Nact_5.run();
conv_5.run();
Nact_5.run();
Npool_5.run();
//fc_8
fc_8.run();
softmax.run();
gettimeofday(&end, NULL);
std::cout << "Iteration " << j << compute_elapsed(start,end) << std::endl;
}
Hope this helps.
Yes, I have tried 50 iterations and the first 5 iters served as the warm-up. The average computation time of the direct function call is still much longer than the Graph API. I would like to know whether the Graph API has some special optimizations when calling the src backend.
Hi @GGGGxxxxxxxxr
No, there should be no significant difference in the execution time between the function and the graph api. The graph api is just a thin layer on the top of the functions which is more expressive and allows to create a network writing less code to than when using just functions. All the optimizations are at the kernel level and there are no optimizations at the graph level.
I think the workloads you are comparing may not be exactly the same? Why don't you just try comparing one layer ?
Hi! Thanks for your reply! I have just made a very simple comparison on the provided examples, first one is /examples/neon_cnn.cpp and I have implemented this simple CNN via Graph API based on the graph_vgg16.cpp file.
Here is my implementation for this simple CNN via Graph API:
#include "arm_compute/graph.h"
#include "support/ToolchainSupport.h"
#include "utils/CommonGraphOptions.h"
#include "utils/GraphUtils.h"
#include "utils/Utils.h"
using namespace arm_compute::utils;
using namespace arm_compute::graph::frontend;
using namespace arm_compute::graph_utils;
/** Example demonstrating how to implement VGG16's network using the Compute Library's graph API */
class GraphVGG16Example : public Example
{
public:
GraphVGG16Example()
: cmd_parser(), common_opts(cmd_parser), common_params(), graph(0, "VGG16")
{
}
std::unique_ptr<arm_compute::graph::ITensorAccessor> dummy()
{
return std::make_unique<DummyAccessor>(1);
}
bool do_setup(int argc, char **argv) override
{
// Parse arguments
cmd_parser.parse(argc, argv);
cmd_parser.validate();
// Consume common parameters
common_params = consume_common_graph_parameters(common_opts);
// Return when help menu is requested
if(common_params.help)
{
cmd_parser.print_help(argv[0]);
return false;
}
// Print parameter values
std::cout << common_params << std::endl;
// Get trainable parameters data path
std::string data_path = common_params.data_path;
// Create a preprocessor object
const std::array<float, 3> mean_rgb{ { 123.68f, 116.779f, 103.939f } };
std::unique_ptr<IPreprocessor> preprocessor = std::make_unique<CaffePreproccessor>(mean_rgb);
// Create input descriptor
const auto operation_layout = DataLayout::NCHW;
const TensorShape tensor_shape = permute_shape(TensorShape(32U, 32U, 1U, common_params.batches), DataLayout::NCHW, operation_layout);
TensorDescriptor input_descriptor = TensorDescriptor(tensor_shape, common_params.data_type).set_layout(operation_layout);
// Set weights trained layout
const DataLayout weights_layout = DataLayout::NCHW;
// Create graph
graph << common_params.target
<< common_params.fast_math_hint
<< InputLayer(input_descriptor, dummy())
// Layer 1
<< ConvolutionLayer(
5U, 5U, 8U,
dummy(),
dummy(),
PadStrideInfo(1, 1, 2, 2))
.set_name("conv1_1")
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU)).set_name("conv1_1/Relu")
<< PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, operation_layout, PadStrideInfo(2, 2, 0, 0))).set_name("pool1")
// Layer 2
<< ConvolutionLayer(
3U, 3U, 16U,
dummy(),
dummy(),
PadStrideInfo(1, 1, 1, 1))
.set_name("conv1_2")
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU)).set_name("conv1_2/Relu")
<< PoolingLayer(PoolingLayerInfo(PoolingType::AVG, 2, operation_layout, PadStrideInfo(2, 2, 0, 0))).set_name("pool2")
//Fully Connected
<< FullyConnectedLayer(
128U,
dummy(),
dummy())
.set_name("fc")
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU)).set_name("conv3_1/Relu")
<< SoftmaxLayer().set_name("prob")
<< OutputLayer(get_output_accessor(common_params, 5));
// Finalize graph
GraphConfig config;
config.num_threads = 1;
config.use_tuner = common_params.enable_tuner;
config.tuner_mode = common_params.tuner_mode;
config.tuner_file = common_params.tuner_file;
config.mlgo_file = common_params.mlgo_file;
config.use_synthetic_type = arm_compute::is_data_type_quantized(common_params.data_type);
config.synthetic_type = common_params.data_type;
graph.finalize(common_params.target, config);
return true;
}
void do_run() override
{
std::cout<<"run cnn model...\n";
// Run graph
struct timeval start, end;
gettimeofday(&start, NULL);
graph.run();
gettimeofday(&end, NULL);
std::cout << std::endl << std::endl << std::endl;
int timeuse = 1000000 * ( end.tv_sec - start.tv_sec ) + end.tv_usec -start.tv_usec;
printf("time: %d us\n", timeuse);
}
private:
CommandLineParser cmd_parser;
CommonGraphOptions common_opts;
CommonGraphParams common_params;
Stream graph;
};
int main(int argc, char **argv)
{
return arm_compute::utils::run_example<GraphVGG16Example>(argc, argv);
}
I have noticed there is still a huge gap between these two results. I am not sure whether I have correctly used the Graph API. I have set all the tensor accessor as dummy tensor.
Really Appreciate for the help!
The speed bench mark is shown in this picture:
I have edited the operation_layout in Graph api to 'NCHW' to make sure both of the direct function call and graph api operate on the same tensor layout.
Hi @GGGGxxxxxxxxr
It's difficult to speculate about this without seeing the full source code, build options and execution commands. I think there is a problem in the way the elapsed time is measured. The graph api just calls into the functions to execute the workloads so there must be no significant performance difference between the two approaches.
In order to understand what is going on you will need to measure individual layers and identify where the difference is coming from.
If you build the library with benchmark_examples=1
then you can use the instruments to look into the graph example performance
acl/main# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--target=NEON,--fast-math'
Version = arm_compute_version=v0.0-unreleased Build options: {'standalone': '1', 'test_filter': 'ActivationLayer.cpp', 'opencl': '0', 'neon': '1', 'validation_tests': '1', 'examples': '0', 'debug': '1', 'arch': 'armv8a', 'benchmark_examples': '1'} Git hash=e112ef1cc70bcdc52ded44350e61eb16d74559b3
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file :
MLGO file :
Fast math enabled? : true
SchedulerTimer/Conv+Conv/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #2: AVG=2.3420 ms
SchedulerTimer/Conv+Conv/BatchNorm/CpuIm2ColKernel #1: AVG=12.4220 ms
SchedulerTimer/Conv+Conv/BatchNorm/CpuWeightsReshapeKernel #0: AVG=0.1020 ms
SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #97: AVG=3.6140 ms
SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuWeightsReshapeKernel #96: AVG=15.0640 ms
SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #98: AVG=0.1920 ms
SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #99: AVG=0.7020 ms
SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #100: AVG=1.1760 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_fp32_logits_1d_max #101: AVG=0.0270 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_fp32_softmax_logits_1d #102: AVG=0.1710 ms
SchedulerTimer/expanded_conv/depthwise/depthwise+expanded_conv/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #3: AVG=1.8800 ms
SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #5: AVG=1.1850 ms
SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuWeightsReshapeKernel #4: AVG=0.0640 ms
SchedulerTimer/expanded_conv_1/depthwise/depthwise+expanded_conv_1/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #8: AVG=2.4930 ms
SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_smallK_hybrid_fp32_mla_6x4 #7: AVG=5.1230 ms
SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuWeightsReshapeKernel #6: AVG=0.2100 ms
SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #10: AVG=1.3540 ms
SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuWeightsReshapeKernel #9: AVG=0.1300 ms
SchedulerTimer/expanded_conv_10/depthwise/depthwise+expanded_conv_10/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #59: AVG=0.3390 ms
SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #58: AVG=0.9590 ms
SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuWeightsReshapeKernel #57: AVG=1.3400 ms
SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #61: AVG=1.3400 ms
SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuWeightsReshapeKernel #60: AVG=1.3500 ms
SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_fp32_add #67: AVG=0.2580 ms
SchedulerTimer/expanded_conv_11/depthwise/depthwise+expanded_conv_11/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #64: AVG=0.6260 ms
SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #63: AVG=2.0200 ms
SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuWeightsReshapeKernel #62: AVG=2.6450 ms
SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #66: AVG=1.8870 ms
SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuWeightsReshapeKernel #65: AVG=1.9040 ms
SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_fp32_add #73: AVG=0.2820 ms
SchedulerTimer/expanded_conv_12/depthwise/depthwise+expanded_conv_12/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #70: AVG=0.5730 ms
SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #69: AVG=2.0760 ms
SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuWeightsReshapeKernel #68: AVG=2.6250 ms
SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #72: AVG=1.8460 ms
SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuWeightsReshapeKernel #71: AVG=1.9330 ms
SchedulerTimer/expanded_conv_13/depthwise/depthwise+expanded_conv_13/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #76: AVG=0.2680 ms
SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #75: AVG=2.0940 ms
SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuWeightsReshapeKernel #74: AVG=2.6290 ms
SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #78: AVG=0.8540 ms
SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuWeightsReshapeKernel #77: AVG=3.1880 ms
SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_fp32_add #84: AVG=0.1270 ms
SchedulerTimer/expanded_conv_14/depthwise/depthwise+expanded_conv_14/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #81: AVG=0.2770 ms
SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #80: AVG=1.4400 ms
SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuWeightsReshapeKernel #79: AVG=6.3610 ms
SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #83: AVG=1.4390 ms
SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuWeightsReshapeKernel #82: AVG=5.1470 ms
SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_fp32_add #90: AVG=0.1260 ms
SchedulerTimer/expanded_conv_15/depthwise/depthwise+expanded_conv_15/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #87: AVG=0.2780 ms
SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #86: AVG=1.4400 ms
SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuWeightsReshapeKernel #85: AVG=6.3720 ms
SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #89: AVG=1.4230 ms
SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuWeightsReshapeKernel #88: AVG=5.1430 ms
SchedulerTimer/expanded_conv_16/depthwise/depthwise+expanded_conv_16/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #93: AVG=0.2750 ms
SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #92: AVG=1.4690 ms
SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuWeightsReshapeKernel #91: AVG=6.3670 ms
SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #95: AVG=2.7300 ms
SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuWeightsReshapeKernel #94: AVG=10.2600 ms
SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_fp32_add #16: AVG=0.9310 ms
SchedulerTimer/expanded_conv_2/depthwise/depthwise+expanded_conv_2/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #13: AVG=2.4990 ms
SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #12: AVG=2.4790 ms
SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuWeightsReshapeKernel #11: AVG=0.3390 ms
SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #15: AVG=2.1000 ms
SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuWeightsReshapeKernel #14: AVG=0.1660 ms
SchedulerTimer/expanded_conv_3/depthwise/depthwise+expanded_conv_3/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #19: AVG=1.0300 ms
SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #18: AVG=2.4800 ms
SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuWeightsReshapeKernel #17: AVG=0.3360 ms
SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #21: AVG=0.6930 ms
SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuWeightsReshapeKernel #20: AVG=0.2140 ms
SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_fp32_add #27: AVG=0.3260 ms
SchedulerTimer/expanded_conv_4/depthwise/depthwise+expanded_conv_4/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #24: AVG=0.7980 ms
SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #23: AVG=1.0150 ms
SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuWeightsReshapeKernel #22: AVG=0.4820 ms
SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #26: AVG=0.8960 ms
SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuWeightsReshapeKernel #25: AVG=0.2610 ms
SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_fp32_add #33: AVG=0.3260 ms
SchedulerTimer/expanded_conv_5/depthwise/depthwise+expanded_conv_5/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #30: AVG=0.7950 ms
SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #29: AVG=1.0570 ms
SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuWeightsReshapeKernel #28: AVG=0.5120 ms
SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #32: AVG=0.9380 ms
SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuWeightsReshapeKernel #31: AVG=0.2580 ms
SchedulerTimer/expanded_conv_6/depthwise/depthwise+expanded_conv_6/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #36: AVG=0.2630 ms
SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #35: AVG=1.0330 ms
SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuWeightsReshapeKernel #34: AVG=0.4840 ms
SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #38: AVG=0.4170 ms
SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuWeightsReshapeKernel #37: AVG=0.4970 ms
SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_fp32_add #44: AVG=0.1770 ms
SchedulerTimer/expanded_conv_7/depthwise/depthwise+expanded_conv_7/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #41: AVG=0.3420 ms
SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #40: AVG=0.9150 ms
SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuWeightsReshapeKernel #39: AVG=1.3650 ms
SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #43: AVG=0.9360 ms
SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuWeightsReshapeKernel #42: AVG=0.9110 ms
SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_fp32_add #50: AVG=0.1770 ms
SchedulerTimer/expanded_conv_8/depthwise/depthwise+expanded_conv_8/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #47: AVG=0.3850 ms
SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #46: AVG=0.9580 ms
SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuWeightsReshapeKernel #45: AVG=1.3400 ms
SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #49: AVG=0.9050 ms
SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuWeightsReshapeKernel #48: AVG=0.8890 ms
SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_fp32_add #56: AVG=0.2040 ms
SchedulerTimer/expanded_conv_9/depthwise/depthwise+expanded_conv_9/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #53: AVG=0.3510 ms
SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #52: AVG=0.9690 ms
SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuWeightsReshapeKernel #51: AVG=1.3690 ms
SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #55: AVG=0.9100 ms
SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuWeightsReshapeKernel #54: AVG=0.8850 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)
Please note that NHWC performance is better than NCHW.
Hope this helps.
That really helps me! Such benchmark tool is very helpful for performance debugging!! Thanks!
Hi!
I have tried to implement neural networks on Android based on Arm Compute Library. The case is that: When I tried to use direct function call to build a network such as Alexnet, the inference time is much longer than the speed performance of examples/graph_alexnet.cpp.
Here is part of the code how I implement AlexNet with direct function call:
// Tensor Definition constexpr unsigned int input_width = 227; constexpr unsigned int input_height = 227; constexpr unsigned int input_fm = 3;
....
....
//Run the model gettimeofday(&start, NULL); conv_1.run(); Nact_1.run(); Npool_1.run(); LRN_1.run();
I have implemented this on my Realm X5Pro and with my personal function call method, it took 70ms on average for one inference but the graph_alexnet benchmark is much faster.
I have made the comments on the source codes in src/cpu/operators and I have noticed that both methods have called the same backend sources, but why there is a huge difference in the speed performance?
Thank!