ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.8k stars 774 forks source link

Why is 1D convolution on CPU via NEConvolutionLayer so slow? #1119

Open poltomo opened 2 months ago

poltomo commented 2 months ago

Benchmark details: 1d convolution of a 2^16 wide 1D input signal with a length 3 kernel. Both input and output channels are 1. There is no bias term.

% strings arm_compute-v24.06-bin-android-arm64-v8a-neon/lib/arm64-v8a-neon-asserts/libarm_compute.so | grep arm_compute_version

arm_compute_version=v24.06 Build options: {'arch': 'arm64-v8a', 'neon': '1', 'opencl': '0', 'os': 'android', 'build_dir': 'arm64-v8a-neon-asserts', 'asserts': '1', 'Werror': '1', 'embed_kernels': '1'} Git hash=unknown

Here's my benchmark: benchmark_acl.cpp

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "arm_compute/runtime/NEON/functions/NEDeconvolutionLayer.h"

#include <chrono>
#include<iostream>

using namespace std;
using namespace arm_compute;

struct Timer {
    std::chrono::time_point<std::chrono::high_resolution_clock> start;
    Timer() {
        start = std::chrono::high_resolution_clock::now();
    }
    ~Timer() {
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> duration = end - start;
        std::cout << "time "<< duration.count() << '\n';
    }
};

int main()
{
    Tensor conv_input;
    Tensor conv_weight;
    Tensor conv_bias;
    Tensor conv_output;

    const int N = 1;
    const int Hi = 1;
    const int Wi = 1<<20;
    const int Ci = 1;

    const int Hf = 1;
    const int Wf = 3;

    const int Ho = Hi - Hf + 1;
    const int Wo = Wi - Wf + 1;
    const int Co = 1;

    conv_input.allocator()->init(TensorInfo(TensorShape(Hi, Wi, Ci), 1, DataType::F32, DataLayout::NHWC));
    conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci, Co), 1, DataType::F32, DataLayout::NHWC));
    // conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
    conv_output.allocator()->init(TensorInfo(TensorShape(Ho, Wo, Co), 1, DataType::F32, DataLayout::NHWC));

    conv_input.allocator()->allocate();
    conv_weight.allocator()->allocate();
    // conv_bias.allocator()->allocate();
    conv_output.allocator()->allocate();

    for (int i = 0;i < conv_input.info()->tensor_shape().total_size();++i) {
        ((float*)conv_input.buffer())[i] = i + 1;
    }
    for (int i = 0;i < conv_weight.info()->tensor_shape().total_size();++i) {
        ((float*)conv_weight.buffer())[i] = i + 1;
    }

    NEConvolutionLayer conv;

// enum class ConvolutionMethod
// {
//     GEMM,        /**< Convolution using GEMM */
//     GEMM_CONV2D, /**< Direct 2D GEMM convolution */
//     DIRECT,      /**< Direct convolution */
//     INDIRECT,    /**< Indirect convolution */
//     WINOGRAD,    /**< Convolution using Winograd */
//     FFT          /**< Convolution using FFT */
// };

// prints the number of the method: 0 for GEMM, 1 for GEMM_CONV2d, ...
    cout << (int)NEConvolutionLayer::get_convolution_method(conv_input.info(), conv_weight.info(),
                   conv_output.info(),
                   PadStrideInfo(1, 1, 0, 0)
                   ,WeightsInfo()
                   ,Size2D(1U, 1U)
                ,ActivationLayerInfo()
                ,true) << endl;

    conv.configure(&conv_input,
                   &conv_weight,
                nullptr,
                   &conv_output,
                   PadStrideInfo(1, 1, 0, 0)
                   ,WeightsInfo()
                   ,Size2D(1U, 1U)
                ,ActivationLayerInfo()
                ,true // fast math enabled
                   );

    {
        Timer timer;
        conv.run();
    }

// verify first 5 elements of output
    for (int i = 0;i < conv_output.info()->tensor_shape().total_size() && i < 5;++i) {
        cout << ((float*)conv_output.buffer())[i] << ' ';
    } cout << endl;

// compute sum of output to prevent compiler from removing the convolution calculation
    float sum = 0;
    for (int i = 0;i < conv_output.info()->tensor_shape().total_size();++i) {
        sum += ((float*)conv_output.buffer())[i];
    }
    cout << sum << endl;

    return 0;
}

output

0
time 0.0481771
14 20 26 32 38 
3.29949e+12

The 0 means that the first enum element GEMM is being used. convolution of 1, 2, 3, ... with 1,2,3 is 14,20,26,32,38,..., so the correct answer is being computed.

Why is it so slow?

For reference I made my own 1D direct convolution implementation and achieved time 0.00166391 This was without openmp multithreading, just plain implementation with compiler optimizations.

What could be the reason for this?

Also, here is ARM Compute Library's Direct Conv performance:

arm_compute::NEDirectConvolutionLayer conv;
conv.configure(&conv_input,
                   &conv_weight,
                nullptr,
                   &conv_output,
                   PadStrideInfo(1, 1, 0, 0),
                ActivationLayerInfo()
                   );

    {
        Timer timer;
        conv.run();
    }

output time 0.0609249

my device info

CPU Support ARM NEON: Yes
CPU Support ARM BF16: No
CPU Support ARM EDSP: No
CPU Support ARM VFPV4: Yes
CPU Support ARM ASIMDHP: Yes
CPU Support ARM CPUID: Yes
CPU Support ARM ASIMDDP: Yes
CPU Support ARM ASIMDFHM: No
CPU Support ARM I8MM: No
CPU Support ARM SVE: No
CPU Support ARM SVE2: No
CPU Support ARM SVEBF16: No
CPU Support ARM SVEI8MM: No
CPU Support ARM SVEF32MM: No
RISCV: No
RISCV ZFH: No
RISCV vector length in bytes: 0
CPU COUNT: 8
LITTLE CPU COUNT: 4
BIG CPU COUNT: 4
PHYSICAL CPU COUNT: 8
PHYSICAL LITTLE CPU COUNT: 4
PHYSICAL BIG CPU COUNT: 4
CPU LEVEL2 cache size: 256 KB
CPU LEVEL3 cache size: 0 KB

I compiled with against latest android cpu release shared lib

aarch64-linux-android26-clang++ **-Ofast -ffast-math src/benchmark_acl.cpp** arm_compute-v24.06-bin-android-arm64-v8a-neon/utils/Utils.cpp -Iarm_compute-v24.06-bin-android-arm64-v8a-neon -Iarm_compute-v24.06-bin-android-arm64-v8a-neon/include -std=c++14 -Larm_compute-v24.06-bin-android-arm64-v8a-neon -L arm_compute-v24.06-bin-android-arm64-v8a-neon/lib/arm64-v8a-neon/ -larm_compute-static -o bin/benchmark_acl -static-libstdc++ -pie
morgolock commented 2 months ago

Hi @poltomo

The first iteration is costly because ACL performs various transformations to the input and the weights so that then the computation can be done faster.

I'd suggest you try two things:

  1. Do a warmup call to conv.run(); and don't time it. Then do the actual computation calling again to run() and assess the performance, it will be faster.
  2. Build acl with cppthreads=0 openmp=1. This will make ACL use only the openmp scheduler which scales better as the number of threads increases.
  3. You will get the best performance with NHWC. Avoid NCHW as it's been deprecated.

Hope this helps

poltomo commented 2 months ago

Hi, @morgolock

1. I tried the warmup call, but it is still 10x slower than my implementation. I got 0.017 seconds.

2. I think I am using NHWC. Could you confirm that my tensor initializations are actually NHWC?

Tensor conv_input;
Tensor conv_weight;
Tensor conv_bias;
Tensor conv_output;

const int N = 1;
const int Hi = 1;
const int Wi = 1<<20;
const int Ci = 1;

const int Hf = 1;
const int Wf = 3;

const int Ho = Hi - Hf + 1;
const int Wo = Wi - Wf + 1;
const int Co = 1;

cout << "f_n = " << Wi << "\ng_n = " << Wf << "\nh_n = " << Wo << "\n";

conv_input.allocator()->init(TensorInfo(TensorShape(Ci, Wi, Hi), 1, DataType::F32, DataLayout::NHWC));
conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci), 1, DataType::F32, DataLayout::NHWC));
// conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
conv_output.allocator()->init(TensorInfo(TensorShape(Co, Wo, Ho), 1, DataType::F32, DataLayout::NHWC));

Ci is input channels, Wi is input width. Wf is filter width and so on.

poltomo commented 2 months ago

Hi, @morgolock

I have 7 questions (see numbers)

1. High Level Question: How do I get the exact inference time of ARM Compute Library's convolution implementations minus and runtime/scheduler overhead?

I found the implementation I want to benchmark here. How do I benchmark this alone? What is the window reference argument ComputeLibrary/src/cpu/kernels/directconv2d/nhwc/neon/impl.cpp

I built ARM compute library just for neon support. I fixed the build archiving issue with alias aarch64-linux-android26-ar=/root/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar

Here's how I built the library.

2. Please tell me if there are any flags that I am missing out on. I want to be fair to this library

CC=aarch64-linux-android26-clang CXX=aarch64-linux-android26-clang++ scons build_dir=build_neon/ toolchain_prefix="" Werror=1 -j4 debug=0 asserts=0 neon=1 cppthreads=0 openmp=0 opencl=0 embed_kernels=1 os=android arch=arm64-v8a

no openmp, opencl or cppthreads.

3. Please let me know if the build configuration is not being fair to ARM compute library.

$ strings build/libarm_compute.so | grep arm_compute_version
arm_compute_version=v24.06 Build options: {'build_dir': 'build_neon/', 'toolchain_prefix': '', 'Werror': '1', 'debug': '0', 'asserts': '0', 'neon': '1', 'cppthreads': '0', 'openmp': '0', 'opencl': '0', 'embed_kernels': '1', 'os': 'android', 'arch': 'arm64-v8a'} Git hash=b'93e6401a3bf2da5ed0b19b50625eb3f9edb2b50e'

Here's my benchmark for ARM Compute Library

Important Questions:

4. Is conv.run() doing things that do not relate to convolution? Ex. runtime/scheduler things

How do I benchmark just the convolution implementation

6. Are my tensor initializations plus their dimensions optimal for a 1D NHWC convolution?

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "arm_compute/runtime/NEON/functions/NEDeconvolutionLayer.h"
#include <chrono>

#include<iostream>

using namespace std;
using namespace arm_compute;

struct Timer {
    std::chrono::time_point<std::chrono::high_resolution_clock> start;
    std::chrono::duration<double>* time;
    Timer(std::chrono::duration<double>* time) : start{std::chrono::high_resolution_clock::now()}, time{time} {}
    ~Timer() {
        auto end = std::chrono::high_resolution_clock::now();
        *time += (end - start);
    }
};

int main()
{
    Tensor conv_input;
    Tensor conv_weight;
    Tensor conv_bias;
    Tensor conv_output;

    const int N = 1;
    const int Hi = 1;
    const int Wi = 1<<20;
    const int Ci = 1;

    const int Hf = 1;
    const int Wf = 3;

    const int Ho = Hi - Hf + 1;
    const int Wo = Wi - Wf + 1;
    const int Co = 1;

    cout << "f_n = " << Wi << "\ng_n = " << Wf << "\nh_n = " << Wo << "\n";

    conv_input.allocator()->init(TensorInfo(TensorShape(Ci, Wi, Hi), 1, DataType::F32, DataLayout::NHWC));
    conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci), 1, DataType::F32, DataLayout::NHWC));
    // conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
    conv_output.allocator()->init(TensorInfo(TensorShape(Co, Wo, Ho), 1, DataType::F32, DataLayout::NHWC));

    conv_input.allocator()->allocate();
    conv_weight.allocator()->allocate();
    // conv_bias.allocator()->allocate();
    conv_output.allocator()->allocate();

    arm_compute::NEDirectConvolutionLayer conv;

    conv.configure(&conv_input,
                   &conv_weight,
                nullptr, # no bias
                   &conv_output,
                   PadStrideInfo(1, 1, 0, 0)
                ,ActivationLayerInfo()
                   );

    conv.run();
    memset(conv_output.buffer(), 0, conv_output.info()->tensor_shape().total_size() * sizeof(float));

    double n = 100; // run 100 times
    std::chrono::duration<double> total_time(0);
    for (int i = 0;i < n;++i) {
        for (int i = 0;i < conv_input.info()->tensor_shape().total_size();++i) {
            ((int*)conv_input.buffer())[i] = rand();
        }
        for (int i = 0;i < conv_weight.info()->tensor_shape().total_size();++i) {
            ((int*)conv_weight.buffer())[i] = rand();
        }
        memset(conv_output.buffer(), 0, conv_output.info()->tensor_shape().total_size() * sizeof(float));
        {
            Timer timer(&total_time); # when timer goes out of scope, it adds time to total time
            conv.run();
        }
    }
    std::cout << (total_time.count() / n) << "\n";

    for (int i = 0;i < conv_output.info()->tensor_shape().total_size() && i < 5;++i) {
        cout << ((float*)conv_output.buffer())[i] << ' ';
    } cout << endl;

    float sum = 0;
    for (int i = 0;i < conv_output.info()->tensor_shape().total_size();++i) {
        sum += ((float*)conv_output.buffer())[i];
    }
    cout << sum << endl;

    return 0;
}

results

The average running time for ARM Compute Library's fp32 direct convolution across 100 conv.run()'s was ~0.04 My 1d direct convolution implementation (no openmp, no opencl, just neon) achieved ~0.004 (why is it 10x faster than ARM Compute Library)

7. Why is this the case?

morgolock commented 1 month ago

Hi @poltomo

Thanks, we'll have a look at the performance for this specific configuration.

How do I get the exact inference time of ARM Compute Library's convolution implementations minus and runtime/scheduler overhead

There is no easy way to do this, I would suggest to have a look at our benchmark graph examples. If you build the library with benchmark_examples=1 then you can use the instruments to look into the graph example performance and the time consumed by each individual kernel. Potentially you could modify one of these graph examples to use only the configuration you are interested in and then look at the performance using the instruments as shown below.

main# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--target=NEON,--fast-math'
Version = arm_compute_version=v0.0-unreleased Build options: {'standalone': '1', 'test_filter': 'ActivationLayer.cpp', 'opencl': '0', 'neon': '1', 'validation_tests': '1', 'examples': '0', 'debug': '1', 'arch': 'armv8a', 'benchmark_examples': '1'} Git hash=e112ef1cc70bcdc52ded44350e61eb16d74559b3
CommandLine = ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args=--target=NEON,--fast-math 
Iterations = 1
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : true

  SchedulerTimer/Conv+Conv/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #2:    AVG=2.3420 ms
  SchedulerTimer/Conv+Conv/BatchNorm/CpuIm2ColKernel #1:    AVG=12.4220 ms
  SchedulerTimer/Conv+Conv/BatchNorm/CpuWeightsReshapeKernel #0:    AVG=0.1020 ms
  SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #97:    AVG=3.6140 ms
  SchedulerTimer/Conv_1+Conv_1/BatchNorm/CpuWeightsReshapeKernel #96:    AVG=15.0640 ms
  SchedulerTimer/Logits/AvgPool/CpuPool2dAssemblyWrapperKernel #98:    AVG=0.1920 ms
  SchedulerTimer/Logits/Conv2d_1c_1x1/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #99:    AVG=0.7020 ms
  SchedulerTimer/Predictions/Reshape/CpuReshapeKernel #100:    AVG=1.1760 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DMaxKernel/neon_fp32_logits_1d_max #101:    AVG=0.0270 ms
  SchedulerTimer/Predictions/Softmax/CpuLogits1DSoftmaxKernel/neon_fp32_softmax_logits_1d #102:    AVG=0.1710 ms
  SchedulerTimer/expanded_conv/depthwise/depthwise+expanded_conv/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #3:    AVG=1.8800 ms
  SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #5:    AVG=1.1850 ms
  SchedulerTimer/expanded_conv/project/Conv2D+expanded_conv/project/BatchNorm/CpuWeightsReshapeKernel #4:    AVG=0.0640 ms
  SchedulerTimer/expanded_conv_1/depthwise/depthwise+expanded_conv_1/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #8:    AVG=2.4930 ms
  SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_smallK_hybrid_fp32_mla_6x4 #7:    AVG=5.1230 ms
  SchedulerTimer/expanded_conv_1/expand/Conv2D+expanded_conv_1/expand/BatchNorm/CpuWeightsReshapeKernel #6:    AVG=0.2100 ms
  SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #10:    AVG=1.3540 ms
  SchedulerTimer/expanded_conv_1/project/Conv2D+expanded_conv_1/project/BatchNorm/CpuWeightsReshapeKernel #9:    AVG=0.1300 ms
  SchedulerTimer/expanded_conv_10/depthwise/depthwise+expanded_conv_10/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #59:    AVG=0.3390 ms
  SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #58:    AVG=0.9590 ms
  SchedulerTimer/expanded_conv_10/expand/Conv2D+expanded_conv_10/expand/BatchNorm/CpuWeightsReshapeKernel #57:    AVG=1.3400 ms
  SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #61:    AVG=1.3400 ms
  SchedulerTimer/expanded_conv_10/project/Conv2D+expanded_conv_10/project/BatchNorm/CpuWeightsReshapeKernel #60:    AVG=1.3500 ms
  SchedulerTimer/expanded_conv_11/add/CpuAddKernel/neon_fp32_add #67:    AVG=0.2580 ms
  SchedulerTimer/expanded_conv_11/depthwise/depthwise+expanded_conv_11/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #64:    AVG=0.6260 ms
  SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #63:    AVG=2.0200 ms
  SchedulerTimer/expanded_conv_11/expand/Conv2D+expanded_conv_11/expand/BatchNorm/CpuWeightsReshapeKernel #62:    AVG=2.6450 ms
  SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #66:    AVG=1.8870 ms
  SchedulerTimer/expanded_conv_11/project/Conv2D+expanded_conv_11/project/BatchNorm/CpuWeightsReshapeKernel #65:    AVG=1.9040 ms
  SchedulerTimer/expanded_conv_12/add/CpuAddKernel/neon_fp32_add #73:    AVG=0.2820 ms
  SchedulerTimer/expanded_conv_12/depthwise/depthwise+expanded_conv_12/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #70:    AVG=0.5730 ms
  SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #69:    AVG=2.0760 ms
  SchedulerTimer/expanded_conv_12/expand/Conv2D+expanded_conv_12/expand/BatchNorm/CpuWeightsReshapeKernel #68:    AVG=2.6250 ms
  SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #72:    AVG=1.8460 ms
  SchedulerTimer/expanded_conv_12/project/Conv2D+expanded_conv_12/project/BatchNorm/CpuWeightsReshapeKernel #71:    AVG=1.9330 ms
  SchedulerTimer/expanded_conv_13/depthwise/depthwise+expanded_conv_13/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #76:    AVG=0.2680 ms
  SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #75:    AVG=2.0940 ms
  SchedulerTimer/expanded_conv_13/expand/Conv2D+expanded_conv_13/expand/BatchNorm/CpuWeightsReshapeKernel #74:    AVG=2.6290 ms
  SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #78:    AVG=0.8540 ms
  SchedulerTimer/expanded_conv_13/project/Conv2D+expanded_conv_13/project/BatchNorm/CpuWeightsReshapeKernel #77:    AVG=3.1880 ms
  SchedulerTimer/expanded_conv_14/add/CpuAddKernel/neon_fp32_add #84:    AVG=0.1270 ms
  SchedulerTimer/expanded_conv_14/depthwise/depthwise+expanded_conv_14/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #81:    AVG=0.2770 ms
  SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #80:    AVG=1.4400 ms
  SchedulerTimer/expanded_conv_14/expand/Conv2D+expanded_conv_14/expand/BatchNorm/CpuWeightsReshapeKernel #79:    AVG=6.3610 ms
  SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #83:    AVG=1.4390 ms
  SchedulerTimer/expanded_conv_14/project/Conv2D+expanded_conv_14/project/BatchNorm/CpuWeightsReshapeKernel #82:    AVG=5.1470 ms
  SchedulerTimer/expanded_conv_15/add/CpuAddKernel/neon_fp32_add #90:    AVG=0.1260 ms
  SchedulerTimer/expanded_conv_15/depthwise/depthwise+expanded_conv_15/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #87:    AVG=0.2780 ms
  SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #86:    AVG=1.4400 ms
  SchedulerTimer/expanded_conv_15/expand/Conv2D+expanded_conv_15/expand/BatchNorm/CpuWeightsReshapeKernel #85:    AVG=6.3720 ms
  SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #89:    AVG=1.4230 ms
  SchedulerTimer/expanded_conv_15/project/Conv2D+expanded_conv_15/project/BatchNorm/CpuWeightsReshapeKernel #88:    AVG=5.1430 ms
  SchedulerTimer/expanded_conv_16/depthwise/depthwise+expanded_conv_16/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #93:    AVG=0.2750 ms
  SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #92:    AVG=1.4690 ms
  SchedulerTimer/expanded_conv_16/expand/Conv2D+expanded_conv_16/expand/BatchNorm/CpuWeightsReshapeKernel #91:    AVG=6.3670 ms
  SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #95:    AVG=2.7300 ms
  SchedulerTimer/expanded_conv_16/project/Conv2D+expanded_conv_16/project/BatchNorm/CpuWeightsReshapeKernel #94:    AVG=10.2600 ms
  SchedulerTimer/expanded_conv_2/add/CpuAddKernel/neon_fp32_add #16:    AVG=0.9310 ms
  SchedulerTimer/expanded_conv_2/depthwise/depthwise+expanded_conv_2/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #13:    AVG=2.4990 ms
  SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #12:    AVG=2.4790 ms
  SchedulerTimer/expanded_conv_2/expand/Conv2D+expanded_conv_2/expand/BatchNorm/CpuWeightsReshapeKernel #11:    AVG=0.3390 ms
  SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #15:    AVG=2.1000 ms
  SchedulerTimer/expanded_conv_2/project/Conv2D+expanded_conv_2/project/BatchNorm/CpuWeightsReshapeKernel #14:    AVG=0.1660 ms
  SchedulerTimer/expanded_conv_3/depthwise/depthwise+expanded_conv_3/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #19:    AVG=1.0300 ms
  SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #18:    AVG=2.4800 ms
  SchedulerTimer/expanded_conv_3/expand/Conv2D+expanded_conv_3/expand/BatchNorm/CpuWeightsReshapeKernel #17:    AVG=0.3360 ms
  SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #21:    AVG=0.6930 ms
  SchedulerTimer/expanded_conv_3/project/Conv2D+expanded_conv_3/project/BatchNorm/CpuWeightsReshapeKernel #20:    AVG=0.2140 ms
  SchedulerTimer/expanded_conv_4/add/CpuAddKernel/neon_fp32_add #27:    AVG=0.3260 ms
  SchedulerTimer/expanded_conv_4/depthwise/depthwise+expanded_conv_4/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #24:    AVG=0.7980 ms
  SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #23:    AVG=1.0150 ms
  SchedulerTimer/expanded_conv_4/expand/Conv2D+expanded_conv_4/expand/BatchNorm/CpuWeightsReshapeKernel #22:    AVG=0.4820 ms
  SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #26:    AVG=0.8960 ms
  SchedulerTimer/expanded_conv_4/project/Conv2D+expanded_conv_4/project/BatchNorm/CpuWeightsReshapeKernel #25:    AVG=0.2610 ms
  SchedulerTimer/expanded_conv_5/add/CpuAddKernel/neon_fp32_add #33:    AVG=0.3260 ms
  SchedulerTimer/expanded_conv_5/depthwise/depthwise+expanded_conv_5/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output4x4_mla_depthfirst #30:    AVG=0.7950 ms
  SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #29:    AVG=1.0570 ms
  SchedulerTimer/expanded_conv_5/expand/Conv2D+expanded_conv_5/expand/BatchNorm/CpuWeightsReshapeKernel #28:    AVG=0.5120 ms
  SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #32:    AVG=0.9380 ms
  SchedulerTimer/expanded_conv_5/project/Conv2D+expanded_conv_5/project/BatchNorm/CpuWeightsReshapeKernel #31:    AVG=0.2580 ms
  SchedulerTimer/expanded_conv_6/depthwise/depthwise+expanded_conv_6/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s2_output2x2_mla_depthfirst #36:    AVG=0.2630 ms
  SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #35:    AVG=1.0330 ms
  SchedulerTimer/expanded_conv_6/expand/Conv2D+expanded_conv_6/expand/BatchNorm/CpuWeightsReshapeKernel #34:    AVG=0.4840 ms
  SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #38:    AVG=0.4170 ms
  SchedulerTimer/expanded_conv_6/project/Conv2D+expanded_conv_6/project/BatchNorm/CpuWeightsReshapeKernel #37:    AVG=0.4970 ms
  SchedulerTimer/expanded_conv_7/add/CpuAddKernel/neon_fp32_add #44:    AVG=0.1770 ms
  SchedulerTimer/expanded_conv_7/depthwise/depthwise+expanded_conv_7/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #41:    AVG=0.3420 ms
  SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #40:    AVG=0.9150 ms
  SchedulerTimer/expanded_conv_7/expand/Conv2D+expanded_conv_7/expand/BatchNorm/CpuWeightsReshapeKernel #39:    AVG=1.3650 ms
  SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #43:    AVG=0.9360 ms
  SchedulerTimer/expanded_conv_7/project/Conv2D+expanded_conv_7/project/BatchNorm/CpuWeightsReshapeKernel #42:    AVG=0.9110 ms
  SchedulerTimer/expanded_conv_8/add/CpuAddKernel/neon_fp32_add #50:    AVG=0.1770 ms
  SchedulerTimer/expanded_conv_8/depthwise/depthwise+expanded_conv_8/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #47:    AVG=0.3850 ms
  SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #46:    AVG=0.9580 ms
  SchedulerTimer/expanded_conv_8/expand/Conv2D+expanded_conv_8/expand/BatchNorm/CpuWeightsReshapeKernel #45:    AVG=1.3400 ms
  SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #49:    AVG=0.9050 ms
  SchedulerTimer/expanded_conv_8/project/Conv2D+expanded_conv_8/project/BatchNorm/CpuWeightsReshapeKernel #48:    AVG=0.8890 ms
  SchedulerTimer/expanded_conv_9/add/CpuAddKernel/neon_fp32_add #56:    AVG=0.2040 ms
  SchedulerTimer/expanded_conv_9/depthwise/depthwise+expanded_conv_9/depthwise/BatchNorm/CpuDepthwiseConv2dAssemblyWrapperKernel/a64_fp32_nhwc_3x3_s1_output2x2_mla_depthfirst #53:    AVG=0.3510 ms
  SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_4x24 #52:    AVG=0.9690 ms
  SchedulerTimer/expanded_conv_9/expand/Conv2D+expanded_conv_9/expand/BatchNorm/CpuWeightsReshapeKernel #51:    AVG=1.3690 ms
  SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuGemmAssemblyWrapperKernel/a64_hybrid_fp32_mla_6x16 #55:    AVG=0.9100 ms
  SchedulerTimer/expanded_conv_9/project/Conv2D+expanded_conv_9/project/BatchNorm/CpuWeightsReshapeKernel #54:    AVG=0.8850 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)

Please tell me if there are any flags that I am missing out on. I want to be fair to this library

You'll get the best performance out of ACL if you build with openmp=1 cppthreads=0. These two options will make ACL to use multiple threads to run the kernels with openmp. If you build with openmp=0 cppthreads=0 then ACL will have a single thread scheduler.

Is conv.run() doing things that do not relate to convolution? Ex. runtime/scheduler things

Yes, a lot is happening under the hood in NEConvolutionLayer. From the algorithm point of view, depending on the workload configuration (shapes, data_types, layouts, etc) various transformations are used to prepare the data in memory in an optional way to achieve maximum performance in the computation. This is the reason why the first iteration is costly and slower than the next ones. The options you use to build ACL will also affect the performance, enabling one of the schedulers (openmp cppthreads) will make ACL to use multiple threads to run the kernels.

Are my tensor initializations plus their dimensions optimal for a 1D NHWC convolution?

ACL has been desgined to run efficently the most common workloads present in major models like the ones you can see in our graph examples. Could you please let us know what the use case is for this specific shape and configuration you are running? Is this from a concrete ML model? I assume when you say optimal you mean from the performance perspective?

What is the actual device and version of Android you are using to run your test?

Hope this helps

poltomo commented 1 month ago

Hi, @morgolock I am targeting Android API 26.

I found out how to target direct convolutions directly without the runtime:

#include "src/cpu/kernels/directconv2d/nhwc/neon/fp32.cpp"
...
Window win = calculate_max_window(*conv_output.info(), Steps());
arm_compute::cpu::kernels::neon_fp32_nhwc_directconv2d(win, &conv_input, &conv_weight, &conv_output, PadStrideInfo(1,1,0,0));

Thankfully, this works in the 1d case.

Its about 20 to 30x slower than my 1d convolution implementation. Its slow for openmp builds, and soleley neon builds.

I guess its alright since 2d is in the name of the kernel. I'd be happy to just add the op to the library. How do I do that?

I think this library will have to start fragmenting convolution implementations. There's just too much performance potential at stake. It can be done without making things messy too. NEConvolutionLayer already chooses an implementation for you, so why not explicitly implement popular convs like 3x3 stride 1, winograd 3x3 and so on?

morgolock commented 1 month ago

Hi @poltomo

Please see our contribution guide for more information on how to add a new operator

NEConvolutionLayer already chooses an implementation for you, so why not explicitly implement popular convs like 3x3 stride 1, winograd 3x3 and so on?

ConvLayer has different convolution methods and there is an heuristic in place which selects the best method based on the workload configuration (shapes, types, layout, etc). You would need to add your 1d kernel and make the necessary changes so that NEConvolutionLayer selects the new kernel for the correct workloads

Hope this helps