ARM Convolution example, neon_cnn.cpp, taking too much time.

samashu007 commented 6 years ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v18.03 Build options: {'arch': 'armv7a', 'opencl': '0', 'neon': '1', 'examples': '1', 'asserts': '0', 'debug': '0', 'os': 'linux', 'Werror': '1'} Git hash=02c62c8030e7aca592b294396556a93c6bfb9f7a

Platform: Raspberry Pi Operating System: Raspian OS

Problem description: I am testing the performance of ARM NEON for a 2x2x3x3 convolution kernel on a 5x5x3 image. I modified the neon_cnn.cpp file, setting the parameters as desired. Also, I deleted the functions for conv1, act1, fc0 and softmax as I was only interested in a simple 2x2 convolution for a 5x5x3 image. I am profiling the modified function for time measurement. Specifically, I have profiled across the conv0->run() function as it only performs the required convolution (the additions and multiplications). the conv->run() function runs in a loop (no. of iterations = 100) for testing purposes. The results of time profiling as described above are coming out to be 23.237ms. Isnt it too much? The same 2x2 convolution on a 5x5 image is performed at a much faster rate without using NEON functions i.e. an independent C code. What are the reasons for such discrepancies in results?

Here is the code (modified neon_cnn.cpp) which implements the convolution on ARM NEON:

#include "arm_compute/runtime/NEON/NEFunctions.h"

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/Allocator.h"
#include "arm_compute/runtime/BlobLifetimeManager.h"
#include "arm_compute/runtime/MemoryManagerOnDemand.h"
#include "arm_compute/runtime/PoolManager.h"
#include "utils/Utils.h"
#include <iostream>
#include<ctime>
#include<sys/time.h>

using namespace std;
using namespace arm_compute;
using namespace utils;

class NEONCNNExample : public Example
{
public:
    void do_setup(int argc, char **argv) override
    {
        ARM_COMPUTE_UNUSED(argc);
        ARM_COMPUTE_UNUSED(argv);

        auto lifetime_mgr0  = std::make_shared<BlobLifetimeManager>();                           
        auto lifetime_mgr1  = std::make_shared<BlobLifetimeManager>();                          
        auto pool_mgr0      = std::make_shared<PoolManager>();                                  
        auto pool_mgr1      = std::make_shared<PoolManager>();                                   
        auto mm_layers      = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr0, pool_mgr0); 
        auto mm_transitions = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr1, pool_mgr1);        
        conv0   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);

        constexpr unsigned int width_src_image  = 5;
        constexpr unsigned int height_src_image = 5;
        constexpr unsigned int ifm_src_img      = 3;

        const TensorShape src_shape(width_src_image, height_src_image, ifm_src_img);
        src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32));

        constexpr unsigned int kernel_x_conv0 = 2;
        constexpr unsigned int kernel_y_conv0 = 2;
        constexpr unsigned int ofm_conv0      = 3;

        const TensorShape weights_shape_conv0(kernel_x_conv0, kernel_y_conv0, src_shape.z(), ofm_conv0);
        const TensorShape biases_shape_conv0(weights_shape_conv0[3]);
        const TensorShape out_shape_conv0(src_shape.x(), src_shape.y(), weights_shape_conv0[3]);

        weights0.allocator()->init(TensorInfo(weights_shape_conv0, 1, DataType::F32));
        biases0.allocator()->init(TensorInfo(biases_shape_conv0, 1, DataType::F32));
        out_conv0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));

    conv0->configure(&src, &weights0, &biases0, &out_conv0, PadStrideInfo(1 /* stride_x */, 1 /* stride_y */, 1 /* pad_x */, 1 /* pad_y */));

        memory_group0 = arm_compute::support::cpp14::make_unique<MemoryGroup>(mm_transitions);
        memory_group1 = arm_compute::support::cpp14::make_unique<MemoryGroup>(mm_transitions);

        memory_group0->manage(&out_conv0);
        out_conv0.allocator()->allocate();

        src.allocator()->allocate();
        weights0.allocator()->allocate();
        biases0.allocator()->allocate();

        mm_layers->set_allocator(&allocator);

        mm_layers->set_num_pools(1);

        mm_layers->finalize();

        mm_transitions->set_allocator(&allocator);

        mm_transitions->set_num_pools(2);

        mm_transitions->finalize();
    }
    void do_run() override
    {

        memory_group0->acquire();
        memory_group1->acquire();

    struct timeval t_strt, t_end;
    gettimeofday(&t_strt,NULL);

    int i = 0;  
    for(i=0;i<100;i++)
    {
            conv0->run();
    }

    gettimeofday(&t_end, NULL);
    cout<<((t_end.tv_sec * 1000000 + t_end.tv_usec) - (t_strt.tv_sec * 1000000 + t_strt.tv_usec));

        memory_group0->release();
        memory_group1->release();
    }

private:

    Tensor src{};

    Tensor weights0{};

    Tensor biases0{};

    Tensor out_conv0{};

    Allocator allocator{};

    std::unique_ptr<MemoryGroup> memory_group0{};
    std::unique_ptr<MemoryGroup> memory_group1{};

    std::unique_ptr<NEConvolutionLayer>    conv0{};

};

int main(int argc, char **argv)
{
    return utils::run_example<NEONCNNExample>(argc, argv);
}

AnthonyBarbier commented 6 years ago

Hard to tell, how is your platform configured ? (Which raspberry pi is it ? What CPUs are enabled ? At what frequency ?) Also Raspbian might be using soft-float instead of hard-float as it's compiled for armv6 instead of armv7. Finally, you might want to call conv0->run() once before you start timing (Some weights reshaping is happening during the first run() which makes it a lot slower than the subsequent calls)

samashu007 commented 6 years ago

Attaching the outputs of the following commands:

cat /etc/debian_version
cat /etc/os-release
cat /proc/cpuinfo

Also, the max and minimum clock frequencies are 600 MHz and 1200 MHz.

The arm-compute library was build using arch=armv7a.

Also, upon executing the conv->run() once before timing, the results have improved a little. The original implementation (pure C) takes 11.2ms for 100 iterations. The implementation on ARM NEON takes 30.58 ms for 100 iterations. Though results have improved, yet this is unlikely to happen.

samashu007 commented 6 years ago

For a 3x3x3x3 convolution on a 50x50x3 image, I am getting a reduction in time by a factor of 2. Original Time taken: 1.001 sec (100 iterations) NEON Implementation: 0.510 (100 iterations) However, the time is supposed to get reduced by a factor of 6 or 7 in case of NEON Implementation. Kindly suggest further steps for this.

samashu007 commented 6 years ago

Any suggestions? Please?

samashu007 commented 6 years ago

I modified the configure() function, and forcibly made it choose the direct convolution method, instead of GEMM. The results have drastically improved. For a 3x3x3x3 convolution on a 50x50x3 image, it is: 0.043 seconds (NEON Implementation) 1.083 seconds (Original C testbench code)

Is such a drastic improvement of 25x correct?

morgolock commented 6 years ago

Hi @samashu007

I believe your input is too small to see the drastic improvement in performance you expect.

I think you'll see bigger improvements if you increase the number of channels to 256 for example and set the convolution method to GEMM.

Hope this helps.

samashu007 commented 6 years ago

That worked. Thanks. :) There is another issue. On running the same conv->run() function say some 10 times, it gives different results. The difference in not that significant in some, maybe a few microseconds, but in some it is changing by a factor of 2. Is this a bug? What could be the possible reasons?

morgolock commented 6 years ago

Hi @samashu007

Possible reasons might be 'power saving' mode enabled.
System load (are there other processes running at the same time?)
Threads scheduling issues. (can you reproduce the problem when running the benchmark with a single thread?)

Hope this helps.

samashu007 commented 6 years ago

On the remote machine, through which I am cross-compiling, there are other processes running, but not on the Raspberry Pi itself. Also I built ComputeLibrary using 'scons Werror=1 debug=0 asserts=0 neon=1 opencl=0 examples=1 build=native -j2'. Should I remove the '-j2' part to build it on a single thread?

samashu007 commented 6 years ago

Please reply?

morgolock commented 6 years ago

@samashu007

Building with -j2 is fine. I was referring to the number of threads the function uses at run-time when you execute the example. You can try experimenting with different number of threads when running the example and see what effect this has.

Also consider that the first run will be slower than the next ones as there is some extra work to do reshaping matrices.

Hope this helps.

GeorgeARM commented 6 years ago

@samashu007 as @morgolock suggested check that your cpus run in a fixed frequency and that your governor is set for example to performance. Otherwise your frequency might scale for a variety of reasons. You can find online ways to do so.

samashu007 commented 6 years ago

I tried doing the same. The cpu runs at the max frequency of 1.2Ghz, with GOVERNOR="performance". There is no improvement. The results are still varying on every execution.

gmiodice commented 6 years ago

Hi @samashu007,

have you tried to measure the execution time of graph examples? For instance what performance do you get running squeezenet or alexnet?

Thanks

morgolock commented 5 years ago

Hi @samashu007

Could you please try to measure the time it takes each call to conv0->run();. If there are big differences it's likely to one of the reasons mentioned above: system load, power saving, thread scheduling policy.

The shapes in your test are too small to see big performance gains in ACL's Neon code, I'd suggest you increase the channels of the input tensor.

I'll close this issue. Please create a new one for performance discussion if you have doubts.

ARM-software / ComputeLibrary

ARM Convolution example, neon_cnn.cpp, taking too much time. #440