Closed poltomo closed 1 month ago
Hi @poltomo
The first iteration is costly because ACL performs various transformations to the input and the weights so that then the computation can be done faster.
I'd suggest you try two things:
conv.run();
and don't time it. Then do the actual computation calling again to run()
and assess the performance, it will be faster.cppthreads=0 openmp=1
. This will make ACL use only the openmp
scheduler which scales better as the number of threads increases.NHWC
. Avoid NCHW
as it's been deprecated. Hope this helps
Hi, @morgolock
Tensor conv_input;
Tensor conv_weight;
Tensor conv_bias;
Tensor conv_output;
const int N = 1;
const int Hi = 1;
const int Wi = 1<<20;
const int Ci = 1;
const int Hf = 1;
const int Wf = 3;
const int Ho = Hi - Hf + 1;
const int Wo = Wi - Wf + 1;
const int Co = 1;
cout << "f_n = " << Wi << "\ng_n = " << Wf << "\nh_n = " << Wo << "\n";
conv_input.allocator()->init(TensorInfo(TensorShape(Ci, Wi, Hi), 1, DataType::F32, DataLayout::NHWC));
conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci), 1, DataType::F32, DataLayout::NHWC));
// conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
conv_output.allocator()->init(TensorInfo(TensorShape(Co, Wo, Ho), 1, DataType::F32, DataLayout::NHWC));
Ci is input channels, Wi is input width. Wf is filter width and so on.
Hi, @morgolock
I have 7 questions (see numbers)
I found the implementation I want to benchmark here. How do I benchmark this alone? What is the window reference argument ComputeLibrary/src/cpu/kernels/directconv2d/nhwc/neon/impl.cpp
I built ARM compute library just for neon support.
I fixed the build archiving issue with alias aarch64-linux-android26-ar=/root/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar
Here's how I built the library.
CC=aarch64-linux-android26-clang CXX=aarch64-linux-android26-clang++ scons build_dir=build_neon/ toolchain_prefix="" Werror=1 -j4 debug=0 asserts=0 neon=1 cppthreads=0 openmp=0 opencl=0 embed_kernels=1 os=android arch=arm64-v8a
no openmp, opencl or cppthreads.
$ strings build/libarm_compute.so | grep arm_compute_version
arm_compute_version=v24.06 Build options: {'build_dir': 'build_neon/', 'toolchain_prefix': '', 'Werror': '1', 'debug': '0', 'asserts': '0', 'neon': '1', 'cppthreads': '0', 'openmp': '0', 'opencl': '0', 'embed_kernels': '1', 'os': 'android', 'arch': 'arm64-v8a'} Git hash=b'93e6401a3bf2da5ed0b19b50625eb3f9edb2b50e'
Important Questions:
conv.run()
doing things that do not relate to convolution? Ex. runtime/scheduler thingsHow do I benchmark just the convolution implementation
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "arm_compute/runtime/NEON/functions/NEDeconvolutionLayer.h"
#include <chrono>
#include<iostream>
using namespace std;
using namespace arm_compute;
struct Timer {
std::chrono::time_point<std::chrono::high_resolution_clock> start;
std::chrono::duration<double>* time;
Timer(std::chrono::duration<double>* time) : start{std::chrono::high_resolution_clock::now()}, time{time} {}
~Timer() {
auto end = std::chrono::high_resolution_clock::now();
*time += (end - start);
}
};
int main()
{
Tensor conv_input;
Tensor conv_weight;
Tensor conv_bias;
Tensor conv_output;
const int N = 1;
const int Hi = 1;
const int Wi = 1<<20;
const int Ci = 1;
const int Hf = 1;
const int Wf = 3;
const int Ho = Hi - Hf + 1;
const int Wo = Wi - Wf + 1;
const int Co = 1;
cout << "f_n = " << Wi << "\ng_n = " << Wf << "\nh_n = " << Wo << "\n";
conv_input.allocator()->init(TensorInfo(TensorShape(Ci, Wi, Hi), 1, DataType::F32, DataLayout::NHWC));
conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci), 1, DataType::F32, DataLayout::NHWC));
// conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
conv_output.allocator()->init(TensorInfo(TensorShape(Co, Wo, Ho), 1, DataType::F32, DataLayout::NHWC));
conv_input.allocator()->allocate();
conv_weight.allocator()->allocate();
// conv_bias.allocator()->allocate();
conv_output.allocator()->allocate();
arm_compute::NEDirectConvolutionLayer conv;
conv.configure(&conv_input,
&conv_weight,
nullptr, # no bias
&conv_output,
PadStrideInfo(1, 1, 0, 0)
,ActivationLayerInfo()
);
conv.run();
memset(conv_output.buffer(), 0, conv_output.info()->tensor_shape().total_size() * sizeof(float));
double n = 100; // run 100 times
std::chrono::duration<double> total_time(0);
for (int i = 0;i < n;++i) {
for (int i = 0;i < conv_input.info()->tensor_shape().total_size();++i) {
((int*)conv_input.buffer())[i] = rand();
}
for (int i = 0;i < conv_weight.info()->tensor_shape().total_size();++i) {
((int*)conv_weight.buffer())[i] = rand();
}
memset(conv_output.buffer(), 0, conv_output.info()->tensor_shape().total_size() * sizeof(float));
{
Timer timer(&total_time); # when timer goes out of scope, it adds time to total time
conv.run();
}
}
std::cout << (total_time.count() / n) << "\n";
for (int i = 0;i < conv_output.info()->tensor_shape().total_size() && i < 5;++i) {
cout << ((float*)conv_output.buffer())[i] << ' ';
} cout << endl;
float sum = 0;
for (int i = 0;i < conv_output.info()->tensor_shape().total_size();++i) {
sum += ((float*)conv_output.buffer())[i];
}
cout << sum << endl;
return 0;
}
The average running time for ARM Compute Library's fp32 direct convolution across 100 conv.run()
's was ~0.04
My 1d direct convolution implementation (no openmp, no opencl, just neon) achieved ~0.004 (why is it 10x faster than ARM Compute Library)
Hi @poltomo
Thanks, we'll have a look at the performance for this specific configuration.
How do I get the exact inference time of ARM Compute Library's convolution implementations minus and runtime/scheduler overhead
There is no easy way to do this, I would suggest to have a look at our benchmark graph examples.
If you build the library with benchmark_examples=1
then you can use the instruments to look into the graph example performance and the time consumed by each individual kernel. Potentially you could modify one of these graph examples to use only the configuration you are interested in and then look at the performance using the instruments as shown below.
main# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--target=NEON,--fast-math'
Version = arm_compute_version=v0.0-unreleased Build options: {'standalone': '1', 'test_filter': 'ActivationLayer.cpp', 'opencl': '0', 'neon': '1', 'validation_tests': '1', 'examples': '0', 'debug': '1', 'arch': 'armv8a', 'benchmark_examples': '1'} Git hash=e112ef1cc70bcdc52ded44350e61eb16d74559b3
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file :
MLGO file :
Fast math enabled? : true
SchedulerTimer/Conv+Conv/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #2: AVG=2.3420 ms
SchedulerTimer/Conv+Conv/BatchNorm/CpuIm2ColKernel #1: AVG=12.4220 ms
SchedulerTimer/Conv+Conv/BatchNorm/CpuWeightsReshapeKernel #0: AVG=0.1020 ms
SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #97: AVG=3.6140 ms
SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuWeightsReshapeKernel #96: AVG=15.0640 ms
SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #98: AVG=0.1920 ms
SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #99: AVG=0.7020 ms
SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #100: AVG=1.1760 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_fp32_logits_1d_max #101: AVG=0.0270 ms
SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_fp32_softmax_logits_1d #102: AVG=0.1710 ms
SchedulerTimer/expanded_conv/depthwise/depthwise+expanded_conv/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #3: AVG=1.8800 ms
SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #5: AVG=1.1850 ms
SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuWeightsReshapeKernel #4: AVG=0.0640 ms
SchedulerTimer/expanded_conv_1/depthwise/depthwise+expanded_conv_1/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #8: AVG=2.4930 ms
SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_smallK_hybrid_fp32_mla_6x4 #7: AVG=5.1230 ms
SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuWeightsReshapeKernel #6: AVG=0.2100 ms
SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #10: AVG=1.3540 ms
SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuWeightsReshapeKernel #9: AVG=0.1300 ms
SchedulerTimer/expanded_conv_10/depthwise/depthwise+expanded_conv_10/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #59: AVG=0.3390 ms
SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #58: AVG=0.9590 ms
SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuWeightsReshapeKernel #57: AVG=1.3400 ms
SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #61: AVG=1.3400 ms
SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuWeightsReshapeKernel #60: AVG=1.3500 ms
SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_fp32_add #67: AVG=0.2580 ms
SchedulerTimer/expanded_conv_11/depthwise/depthwise+expanded_conv_11/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #64: AVG=0.6260 ms
SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #63: AVG=2.0200 ms
SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuWeightsReshapeKernel #62: AVG=2.6450 ms
SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #66: AVG=1.8870 ms
SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuWeightsReshapeKernel #65: AVG=1.9040 ms
SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_fp32_add #73: AVG=0.2820 ms
SchedulerTimer/expanded_conv_12/depthwise/depthwise+expanded_conv_12/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #70: AVG=0.5730 ms
SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #69: AVG=2.0760 ms
SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuWeightsReshapeKernel #68: AVG=2.6250 ms
SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #72: AVG=1.8460 ms
SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuWeightsReshapeKernel #71: AVG=1.9330 ms
SchedulerTimer/expanded_conv_13/depthwise/depthwise+expanded_conv_13/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #76: AVG=0.2680 ms
SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #75: AVG=2.0940 ms
SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuWeightsReshapeKernel #74: AVG=2.6290 ms
SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #78: AVG=0.8540 ms
SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuWeightsReshapeKernel #77: AVG=3.1880 ms
SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_fp32_add #84: AVG=0.1270 ms
SchedulerTimer/expanded_conv_14/depthwise/depthwise+expanded_conv_14/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #81: AVG=0.2770 ms
SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #80: AVG=1.4400 ms
SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuWeightsReshapeKernel #79: AVG=6.3610 ms
SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #83: AVG=1.4390 ms
SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuWeightsReshapeKernel #82: AVG=5.1470 ms
SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_fp32_add #90: AVG=0.1260 ms
SchedulerTimer/expanded_conv_15/depthwise/depthwise+expanded_conv_15/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #87: AVG=0.2780 ms
SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #86: AVG=1.4400 ms
SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuWeightsReshapeKernel #85: AVG=6.3720 ms
SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #89: AVG=1.4230 ms
SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuWeightsReshapeKernel #88: AVG=5.1430 ms
SchedulerTimer/expanded_conv_16/depthwise/depthwise+expanded_conv_16/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #93: AVG=0.2750 ms
SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #92: AVG=1.4690 ms
SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuWeightsReshapeKernel #91: AVG=6.3670 ms
SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #95: AVG=2.7300 ms
SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuWeightsReshapeKernel #94: AVG=10.2600 ms
SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_fp32_add #16: AVG=0.9310 ms
SchedulerTimer/expanded_conv_2/depthwise/depthwise+expanded_conv_2/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #13: AVG=2.4990 ms
SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #12: AVG=2.4790 ms
SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuWeightsReshapeKernel #11: AVG=0.3390 ms
SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #15: AVG=2.1000 ms
SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuWeightsReshapeKernel #14: AVG=0.1660 ms
SchedulerTimer/expanded_conv_3/depthwise/depthwise+expanded_conv_3/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #19: AVG=1.0300 ms
SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #18: AVG=2.4800 ms
SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuWeightsReshapeKernel #17: AVG=0.3360 ms
SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #21: AVG=0.6930 ms
SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuWeightsReshapeKernel #20: AVG=0.2140 ms
SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_fp32_add #27: AVG=0.3260 ms
SchedulerTimer/expanded_conv_4/depthwise/depthwise+expanded_conv_4/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #24: AVG=0.7980 ms
SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #23: AVG=1.0150 ms
SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuWeightsReshapeKernel #22: AVG=0.4820 ms
SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #26: AVG=0.8960 ms
SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuWeightsReshapeKernel #25: AVG=0.2610 ms
SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_fp32_add #33: AVG=0.3260 ms
SchedulerTimer/expanded_conv_5/depthwise/depthwise+expanded_conv_5/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #30: AVG=0.7950 ms
SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #29: AVG=1.0570 ms
SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuWeightsReshapeKernel #28: AVG=0.5120 ms
SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #32: AVG=0.9380 ms
SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuWeightsReshapeKernel #31: AVG=0.2580 ms
SchedulerTimer/expanded_conv_6/depthwise/depthwise+expanded_conv_6/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #36: AVG=0.2630 ms
SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #35: AVG=1.0330 ms
SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuWeightsReshapeKernel #34: AVG=0.4840 ms
SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #38: AVG=0.4170 ms
SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuWeightsReshapeKernel #37: AVG=0.4970 ms
SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_fp32_add #44: AVG=0.1770 ms
SchedulerTimer/expanded_conv_7/depthwise/depthwise+expanded_conv_7/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #41: AVG=0.3420 ms
SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #40: AVG=0.9150 ms
SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuWeightsReshapeKernel #39: AVG=1.3650 ms
SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #43: AVG=0.9360 ms
SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuWeightsReshapeKernel #42: AVG=0.9110 ms
SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_fp32_add #50: AVG=0.1770 ms
SchedulerTimer/expanded_conv_8/depthwise/depthwise+expanded_conv_8/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #47: AVG=0.3850 ms
SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #46: AVG=0.9580 ms
SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuWeightsReshapeKernel #45: AVG=1.3400 ms
SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #49: AVG=0.9050 ms
SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuWeightsReshapeKernel #48: AVG=0.8890 ms
SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_fp32_add #56: AVG=0.2040 ms
SchedulerTimer/expanded_conv_9/depthwise/depthwise+expanded_conv_9/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #53: AVG=0.3510 ms
SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #52: AVG=0.9690 ms
SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuWeightsReshapeKernel #51: AVG=1.3690 ms
SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #55: AVG=0.9100 ms
SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuWeightsReshapeKernel #54: AVG=0.8850 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)
Please tell me if there are any flags that I am missing out on. I want to be fair to this library
You'll get the best performance out of ACL if you build with openmp=1 cppthreads=0
. These two options will make ACL to use multiple threads to run the kernels with openmp. If you build with openmp=0 cppthreads=0
then ACL will have a single thread scheduler.
Is conv.run() doing things that do not relate to convolution? Ex. runtime/scheduler things
Yes, a lot is happening under the hood in NEConvolutionLayer. From the algorithm point of view, depending on the workload configuration (shapes, data_types, layouts, etc) various transformations are used to prepare the data in memory in an optional way to achieve maximum performance in the computation. This is the reason why the first iteration is costly and slower than the next ones. The options you use to build ACL will also affect the performance, enabling one of the schedulers (openmp cppthreads
) will make ACL to use multiple threads to run the kernels.
Are my tensor initializations plus their dimensions optimal for a 1D NHWC convolution?
ACL has been desgined to run efficently the most common workloads present in major models like the ones you can see in our graph examples. Could you please let us know what the use case is for this specific shape and configuration you are running? Is this from a concrete ML model? I assume when you say optimal you mean from the performance perspective?
What is the actual device and version of Android you are using to run your test?
Hope this helps
Hi, @morgolock I am targeting Android API 26.
I found out how to target direct convolutions directly without the runtime:
#include "src/cpu/kernels/directconv2d/nhwc/neon/fp32.cpp"
...
Window win = calculate_max_window(*conv_output.info(), Steps());
arm_compute::cpu::kernels::neon_fp32_nhwc_directconv2d(win, &conv_input, &conv_weight, &conv_output, PadStrideInfo(1,1,0,0));
Thankfully, this works in the 1d case.
Its about 20 to 30x slower than my 1d convolution implementation. Its slow for openmp builds, and soleley neon builds.
I guess its alright since 2d is in the name of the kernel. I'd be happy to just add the op to the library. How do I do that?
I think this library will have to start fragmenting convolution implementations. There's just too much performance potential at stake. It can be done without making things messy too. NEConvolutionLayer already chooses an implementation for you, so why not explicitly implement popular convs like 3x3 stride 1, winograd 3x3 and so on?
Hi @poltomo
Please see our contribution guide for more information on how to add a new operator
NEConvolutionLayer already chooses an implementation for you, so why not explicitly implement popular convs like 3x3 stride 1, winograd 3x3 and so on?
ConvLayer has different convolution methods and there is an heuristic in place which selects the best method based on the workload configuration (shapes, types, layout, etc). You would need to add your 1d kernel and make the necessary changes so that NEConvolutionLayer
selects the new kernel for the correct workloads
Hope this helps
Closing this due to the lack of activity. Please reopen if you still want to submit a patch contributing to ACL. We'd be happy to review patch.
Benchmark details: 1d convolution of a 2^16 wide 1D input signal with a length 3 kernel. Both input and output channels are 1. There is no bias term.
Here's my benchmark: benchmark_acl.cpp
output
The 0 means that the first enum element GEMM is being used. convolution of 1, 2, 3, ... with 1,2,3 is 14,20,26,32,38,..., so the correct answer is being computed.
Why is it so slow?
For reference I made my own 1D direct convolution implementation and achieved
time 0.00166391
This was without openmp multithreading, just plain implementation with compiler optimizations.What could be the reason for this?
Also, here is ARM Compute Library's Direct Conv performance:
output
time 0.0609249
my device info
I compiled with against latest android cpu release shared lib