NEConvolutionLayer is slower than the implementation on TensorflowLite.

GGGGxxxxxxxxr commented 1 year ago

Hi,

I have just configured 540P WDSR Model (one block) with Neon-based Arm Compute Library with NHWC layout with F32 datatype.

I also benchmarked the same model structure based on TensorflowLite on the same Android device which I have used. The total execution time on ArmComputeLibrary (both one-thread) is 150ms Verus 91ms with TensorflowLite.

I have conducted the operator benchmarking in detail.

For example, the Convolution operation is (3, 960, 540) x (3,5,5,12) -> (12, 960, 540).

The average time cost for this operator on ACL is 71.45ms compared to 33.860 ms on TensorflowLite.

Here is my benchmark graph with --instruments=SCHEDULER_TIMER_MS

Here is TensorflowLite Benchmark for the exact same operator:

Here with the benchmark_model_code on ArmComputeLibrary for your reference:

//Arm Compute Library Source
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/Allocator.h"
#include "arm_compute/runtime/BlobLifetimeManager.h"
#include "arm_compute/runtime/MemoryManagerOnDemand.h"
#include "arm_compute/runtime/PoolManager.h"
#include "utils/Utils.h"

#include "support/ToolchainSupport.h"
#include "src/core/NEON/NEMath.h"
#include "src/core/NEON/wrapper/intrinsics/intrinsics.h"
//Common Source
#include <arm_neon.h>
#include <cstdlib>
#include <sstream>
#include <time.h>
#include <fstream>
#include <chrono>

using namespace arm_compute;
using namespace utils;

class NEON_WDSR_FP32_Example : public Example
{
public:
    bool do_setup(int argc, char **argv) override
    {
        NEScheduler::get().set_num_threads(1);

        //Create Memory manager components
        auto lifetime_mgr0 = std::make_shared<BlobLifetimeManager>();
        auto lifetime_mgr1 = std::make_shared<BlobLifetimeManager>();
        auto pool_mgr0     = std::make_shared<PoolManager>();
        auto pool_mgr1     = std::make_shared<PoolManager>();
        auto mm_layers      = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr0, pool_mgr0); 
        auto mm_transitions = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr1, pool_mgr1); 

        //set memory manager where allowed to manage internal memory requirements
        conv1_main_branch = std::make_unique<NEConvolutionLayer>(mm_layers);
        conv2_main_branch = std::make_unique<NEConvolutionLayer>(mm_layers);
        conv3_main_branch = std::make_unique<NEConvolutionLayer>(mm_layers);
        conv4_main_branch = std::make_unique<NEConvolutionLayer>(mm_layers);
        conv5_main_branch = std::make_unique<NEConvolutionLayer>(mm_layers);
        sk2_conv          = std::make_unique<NEConvolutionLayer>(mm_layers);

        //[initialize tensors]
        // initialize src tensor
        TensorShape src_shape(3,960,540);
        src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32, DataLayout::NHWC));
        // initialize conv1_main_branch tensor
        TensorShape conv1_weight_shape(3,3,3,4);
        TensorShape conv1_output_shape(4,960,540);
        conv1_weight.allocator()->init(TensorInfo(conv1_weight_shape, 1, DataType::F32, DataLayout::NHWC));
        conv1_bias.allocator()->init(TensorInfo(TensorShape(4), 1, DataType::F32, DataLayout::NHWC));
        conv1_out.allocator()->init(TensorInfo(conv1_output_shape, 1, DataType::F32, DataLayout::NHWC));

        // initialize conv2_main_branch tensor
        const TensorShape conv2_weight_shape(4,1,1,24);
        const TensorShape conv2_bias_shape(24);
        const TensorShape conv2_output_shape(24, 960, 540);
        conv2_weight.allocator()->init(TensorInfo(conv2_weight_shape, 1, DataType::F32, DataLayout::NHWC));
        conv2_bias.allocator()->init(TensorInfo(conv2_bias_shape, 1, DataType::F32, DataLayout::NHWC));
        conv2_out.allocator()->init(TensorInfo(conv2_output_shape, 1, DataType::F32, DataLayout::NHWC));

        // initialize conv3_main_branch tensor
        const TensorShape conv3_weight_shape(24,1,1,3);
        const TensorShape conv3_bias_shape(3);
        const TensorShape conv3_output_shape(3, 960, 540);
        conv3_weight.allocator()->init(TensorInfo(conv3_weight_shape, 1, DataType::F32, DataLayout::NHWC));
        conv3_bias.allocator()->init(TensorInfo(conv3_bias_shape, 1, DataType::F32, DataLayout::NHWC));
        conv3_out.allocator()->init(TensorInfo(conv3_output_shape, 1, DataType::F32, DataLayout::NHWC));

        // initialize conv4_main_branch tensor
        const TensorShape conv4_weight_shape(3,3,3,4);
        const TensorShape conv4_bias_shape(4);
        const TensorShape conv4_output_shape(4, 960, 540);
        conv4_weight.allocator()->init(TensorInfo(conv4_weight_shape, 1, DataType::F32, DataLayout::NHWC));
        conv4_bias.allocator()->init(TensorInfo(conv4_bias_shape, 1, DataType::F32, DataLayout::NHWC));
        conv4_out.allocator()->init(TensorInfo(conv4_output_shape, 1, DataType::F32, DataLayout::NHWC));

        // initialize conv5_main_branch tensor
        //layer5: Conv5_main_branch
        Tensor conv5_weight, conv5_bias, conv5_out;
        const TensorShape conv5_weight_shape(4,3,3,12);
        const TensorShape conv5_bias_shape(12);
        const TensorShape conv5_output_shape(12, 960, 540);
        conv5_weight.allocator()->init(TensorInfo(conv5_weight_shape, 1, DataType::F32, DataLayout::NHWC));
        conv5_bias.allocator()->init(TensorInfo(conv5_bias_shape, 1, DataType::F32, DataLayout::NHWC));
        conv5_out.allocator()->init(TensorInfo(conv5_output_shape, 1, DataType::F32, DataLayout::NHWC));

        // initialize sk2_conv tensor
        const TensorShape sk2_weight_shape(3,5,5,12);
        const TensorShape sk2_bias_shape(12);
        const TensorShape sk2_output_shape(12, 960, 540);
        sk2_conv_weight.allocator()->init(TensorInfo(sk2_weight_shape, 1, DataType::F32, DataLayout::NHWC));
        sk2_conv_bias.allocator()->init(TensorInfo(sk2_bias_shape, 1, DataType::F32, DataLayout::NHWC));
        sk2_conv_out.allocator()->init(TensorInfo(sk2_output_shape, 1, DataType::F32, DataLayout::NHWC));

        /* -----------end of tensor initialization------------------*/

        // Configure Layers
        conv1_main_branch->configure(&src, &conv1_weight, &conv1_bias, &conv1_out, PadStrideInfo(1,1,1,1));
        sk2_conv->configure(&src, &sk2_conv_weight, &sk2_conv_bias, &sk2_conv_out, PadStrideInfo(1,1,2,2));
        conv2_main_branch->configure(&conv1_out, &conv2_weight, &conv2_bias, &conv2_out, PadStrideInfo(1,1,0,0));
        conv3_main_branch->configure(&conv2_out, &conv3_weight, &conv3_bias, &conv3_out, PadStrideInfo(1,1,0,0));
        conv4_main_branch->configure(&conv3_out, &conv4_weight, &conv4_bias, &conv4_out, PadStrideInfo(1,1,1,1));
        sk1_add.configure(&conv1_out, &conv4_out, &conv4_out, ConvertPolicy::SATURATE);
        conv5_main_branch->configure(&conv4_out, &conv5_weight, &conv5_bias, &conv5_out, PadStrideInfo(1,1,1,1));
        sk2_conv->configure(&src, &sk2_conv_weight, &sk2_conv_bias, &sk2_conv_out, PadStrideInfo(1,1,2,2));
        sk2_add.configure(&sk2_conv_out, &conv5_out, &conv5_out, ConvertPolicy::SATURATE);

        /* -----------end of layer configuration------------------*/

        // Add Tensor to memory manager
        // 2 memory groups for input and output management
        memory_group0 = std::make_unique<MemoryGroup>(mm_transitions);
        memory_group1 = std::make_unique<MemoryGroup>(mm_transitions);

        //memory_group0->manage(&conv1_out);
        //conv1_out.allocator()->allocate();
        memory_group1->manage(&conv2_out);
        conv2_out.allocator()->allocate();
        memory_group0->manage(&conv3_out);
        conv3_out.allocator()->allocate();
        //memory_group1->manage(&conv4_out);
        //conv4_out.allocator()->allocate();
        //memory_group0->manage(&conv5_out);
        //conv5_out.allocator()->allocate();
        memory_group1->manage(&sk2_conv_out);
        sk2_conv_out.allocator()->allocate();

        conv1_out.allocator()->allocate();
        conv4_out.allocator()->allocate();
        conv5_out.allocator()->allocate();

        src.allocator()->allocate();
        conv1_weight.allocator()->allocate();
        conv1_bias.allocator()->allocate();

        conv2_weight.allocator()->allocate();
        conv2_bias.allocator()->allocate();

        conv3_weight.allocator()->allocate();
        conv3_bias.allocator()->allocate();

        conv4_weight.allocator()->allocate();
        conv4_bias.allocator()->allocate();

        conv5_weight.allocator()->allocate();
        conv5_bias.allocator()->allocate();

        sk2_conv_weight.allocator()->allocate();
        sk2_conv_bias.allocator()->allocate();

        //populate layer manager
        mm_layers->populate(allocator, 1);
        //populate transition manager
        mm_transitions->populate(allocator, 2);

        return true;
    }

    void do_run() override
    {
        //Acquire memory for the memory groups
        memory_group0->acquire();
        memory_group1->acquire();

        sk2_conv->run();

        //test only the skiplink convolution
        //conv1_main_branch->run();
        //conv2_main_branch->run();
        //conv3_main_branch->run();
        //conv4_main_branch->run();
        //sk1_add.run();
        //conv5_main_branch->run();
        //sk2_add.run();

        memory_group0->release();
        memory_group1->release();
    }

private:
    // The src tensor should contain the input image
    Tensor src{};
    // Intermediate tensors used
    Tensor conv1_weight{};
    Tensor conv1_bias{};
    Tensor conv1_out{};

    Tensor conv2_weight{};
    Tensor conv2_bias{};
    Tensor conv2_out{};

    Tensor conv3_weight{};
    Tensor conv3_bias{};
    Tensor conv3_out{};

    Tensor conv4_weight{};
    Tensor conv4_bias{};
    Tensor conv4_out{};

    Tensor conv5_weight{};
    Tensor conv5_bias{};
    Tensor conv5_out{};

    Tensor sk2_conv_weight{};
    Tensor sk2_conv_bias{};
    Tensor sk2_conv_out{};

    //Allocator
    Allocator allocator{};

    //Memory groups
    std::unique_ptr<MemoryGroup> memory_group0{};
    std::unique_ptr<MemoryGroup> memory_group1{};

    //Layers
    std::unique_ptr<NEConvolutionLayer> conv1_main_branch{};
    std::unique_ptr<NEConvolutionLayer> conv2_main_branch{};
    std::unique_ptr<NEConvolutionLayer> conv3_main_branch{};
    std::unique_ptr<NEConvolutionLayer> conv4_main_branch{};
    std::unique_ptr<NEConvolutionLayer> conv5_main_branch{};
    std::unique_ptr<NEConvolutionLayer> sk2_conv{};
    NEArithmeticAddition                sk1_add{};
    NEArithmeticAddition                sk2_add{};
};

int main(int argc, char **argv)
{   
    std::cout<<"start testing...\n";

    return utils::run_example<NEON_WDSR_FP32_Example>(argc, argv);
}

The wdsr_540p_fp32.tflite file has been attached also if you want to benchmark the performance for TFLITE:

https://drive.google.com/file/d/1aL9sq8oDUKPu-lZqq5we7s7Vnf0enPnt/view?usp=sharing

Thanks for the patience!

GGGGxxxxxxxxr commented 1 year ago

Hi @morgolock , could you please check this issue? I just want to make sure there is nothing wrong with my settings and the benchmarking I have conducted.

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

I'd suggest you try ArmNN's ExecuteNetwork to assess the performance of this model.

See below the inference time of ACL

root@acl_hikey_9:~/tmp/user/armnn/2302# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./ExecuteNetwork -c CpuAcc -m ../../tflite_models/wdsr_960.tflite -N --iterations=5 |grep Inference
Info: Inference time: 400.57 ms
Info: Inference time: 315.57 ms
Info: Inference time: 318.07 ms
Info: Inference time: 318.24 ms
Info: Inference time: 319.84 ms

Versus the time using tflite inference time

root@acl_hikey_9:~/tmp/user/armnn/2302# LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./ExecuteNetwork -c CpuAcc -m ../../tflite_models/wdsr_960.tflite -N --iterations=5 -T tflite |grep Inference
Info: Inference time: 525.38 ms
Info: Inference time: 516.27 ms
Info: Inference time: 526.44 ms
Info: Inference time: 525.00 ms
Info: Inference time: 552.10 ms

In general, I would recommend using ExecuteNetwork when you have a .tflite model rather than implementing the model by hand using ACL operators.

Hope this helps.

GGGGxxxxxxxxr commented 1 year ago

Hi @morgolock,

Thanks for your advice!

I have tried the pre-built binary of ArmNN on my android devices, here is the command I used for benchmark: ./ExecuteNetwork -c CpuAcc -m /data/local/tmp/wdsr_960.tflite -N --iterations=10 --number-of-threads=1 This is my result:

With 1-thread, it takes about 240ms for this wads_960.tflite model execution.

I have tried the command with "-T" (enable TFLITE runtime), the result is worse than ArmNN. Here is the benchmark with "-T":

But that TFLITE Inferencing speed seems like quite unmatched with the result generated by the official benchmark tool from TFLITE.

Here is the link for TFLITE official benchmark: https://www.tensorflow.org/lite/performance/measurement

Thanks again!

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

But that TFLITE Inferencing speed seems like quite unmatched with the result generated by the official benchmark tool from TFLITE.

The difference you see may be because we are using different tflite versions. I used tflite v2.10, which version of tflite did you use? How big is the difference ?

GGGGxxxxxxxxr commented 1 year ago

Hi @morgolock

I have found out the cause of this issue.

In TensorflowLite, if XNNPACK has been enabled, the performance on CPU would be dramatically increased. If I disable XNNPACK in Tensorflowlite, the performance would be worse than ArmNN.

Seems like XNNPACK provides a super-optimized backend implementation for operators.

It would generate a speedup over 50%. From 220ms to 90ms with one-thread on the same model.

You could try TensorflowLite Benchmark with the TFLITE model, with _use_xnnpack= true / false.

You could tell the difference.

Thanks!

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

I tried with tflite benchmark tool and compared XNNPACK vs ArmNN Delegate

XNNPACK avg=67570.9 ArmNN avg=33329.7

The ArmNN delegate is two times faster than XNNPACK for this specific use case.

Please see below

$ LD_LIBRARY_PATH=./armnn/main/:$LD_LIBRARY_PATH ./linux_aarch64_benchmark_model --graph=./wdsr_960.tflite --num_threads=4 --num_runs=120  --warmup_runs=1 
STARTING!
Log parameter values verbosely: [0]
Min num runs: [120]
Num threads: [4]
Min warmup runs: [1]
Graph: [./wdsr_960.tflite]
#threads used for CPU inference: [4]
Loaded model ./wdsr_960.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
The input model file size (MB): 0.011828
Initialized session in 7.739ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=13 first=89785 curr=35260 min=33983 max=89785 avg=39316.2 std=14580

Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=120 first=35154 curr=94375 min=34827 max=137872 avg=67570.9 std=28514

Inference timings in us: Init: 7739, First inference: 89785, Warmup (avg): 39316.2, Inference (avg): 67570.9
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=3.125 overall=387.75

And below ArmNN

$ LD_LIBRARY_PATH=./armnn/main/:$LD_LIBRARY_PATH ./linux_aarch64_benchmark_model --graph=./wdsr_960.tflite --num_threads=4 --num_runs=120 --warmup_runs=1 --external_delegate_path="armnn/main/libarmnnDelegate.so" --external_delegate_options="backends:CpuAcc"
STARTING!
Log parameter values verbosely: [0]
Min num runs: [120]
Num threads: [4]
Min warmup runs: [1]
Graph: [./wdsr_960.tflite]
#threads used for CPU inference: [4]
External delegate path: [armnn/main/libarmnnDelegate.so]
External delegate options: [backends:CpuAcc]
Loaded model ./wdsr_960.tflite
Can't load libOpenCL.so: libOpenCL.so: cannot open shared object file: No such file or directory
Can't load libGLES_mali.so: libGLES_mali.so: cannot open shared object file: No such file or directory
Can't load libmali.so: libmali.so: cannot open shared object file: No such file or directory
Couldn't find any OpenCL library.
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
EXTERNAL delegate created.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 0.011828
Initialized session in 31.93ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=23 first=40037 curr=26812 min=18118 max=40037 avg=22013.8 std=4449

Running benchmark for at least 120 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=120 first=25867 curr=37586 min=24979 max=46407 avg=33329.7 std=2946

Inference timings in us: Init: 31930, First inference: 40037, Warmup (avg): 22013.8, Inference (avg): 33329.7
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=74.5625 overall=404.5

Hope this helps

ARM-software / ComputeLibrary

NEConvolutionLayer is slower than the implementation on TensorflowLite. #1040