Performance difference between graph example and non-graph example

iam10010 commented 6 years ago

I implemented some simple network as 2 versions following graph example and neon_cnn example. As my computation time check result, I founded that the version based on neon_cnn ex. was about two times slower than graph api ex. version. I guess that thread can be the reason but I am not sure. What do I miss?

Many Thanks.

GeorgeARM commented 6 years ago

Hello @ymbaek, I am not quite sure why this happens, or how you measure the execution time in each case. If you provide the code of both examples we can have a look.

iam10010 commented 6 years ago

Hello, @GeorgeARM, Thank you for helping. The below is my graprh ex code. I used ACL v18.01

...

graph << target_hint
           << convolution_hint
           << Tensor(TensorInfo(TensorShape(288U, 288U, 3U, 1U), 1, DataType::F32), DummyAccessor())
           << ConvolutionMethodHint::DIRECT
           // Layer 1
           << ConvolutionLayer(
                  3U, 3U, 8U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
              << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
              // Layer 2
              << ConvolutionLayer(
                  3U, 3U, 16U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
              << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
              // Layer 3./neo   
              << ConvolutionLayer(
                  3U, 3U, 32U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
              << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
              // Layer 4
              << ConvolutionLayer(
                  3U, 3U, 64U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
              << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
              // Layer 5
              << ConvolutionLayer(
                  3U, 3U, 128U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
              << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
              // Layer 6
              << ConvolutionLayer(
                  3U, 3U, 256U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
             // << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(1, 1, 0, 0)))
              // Layer 7
              << ConvolutionLayer(
                  3U, 3U, 512U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))              
              // Layer 8
              << ConvolutionLayer(
                  3U, 3U, 256U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 1, 1))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
              // Layer 9
              << ConvolutionLayer(
                  1U, 1U, 30U,
                  get_weights_accessor(data_path, "whatever.npy"),
                  get_weights_accessor(data_path, "whatever.npy"),
                  PadStrideInfo(1, 1, 0, 0))
              << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))              
              << Tensor(DummyAccessor());
.
.
void do_run() override
{
        // Run graph
        double s, d;
        for(int i = 0; i < 10; i++){
            s = now_ms();           
            graph.run();            
            d =  now_ms() - s;            
            std::cout << d << "ms\n";\
        }
    }

    static double now_ms(void){
        struct timeval tv;
        gettimeofday(&tv, NULL);
        return tv.tv_sec*1000. + tv.tv_usec/1000.;
    }

here is my neon_ex code.

void do_setup(int argc, char **argv) override
    {
        ARM_COMPUTE_UNUSED(argc);
        ARM_COMPUTE_UNUSED(argv);

        // Create memory manager components
        // We need 2 memory managers: 1 for handling the tensors within the functions (mm_layers) and 1 for handling the input and output tensors of the functions (mm_transitions))
        auto lifetime_mgr0  = std::make_shared<BlobLifetimeManager>();                           // Create lifetime manager
        auto lifetime_mgr1  = std::make_shared<BlobLifetimeManager>();                           // Create lifetime manager
        auto pool_mgr0      = std::make_shared<PoolManager>();                                   // Create pool manager
        auto pool_mgr1      = std::make_shared<PoolManager>();                                   // Create pool manager
        auto mm_layers      = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr0, pool_mgr0); // Create the memory manager
        auto mm_transitions = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr1, pool_mgr1); // Create the memory manager

        // The weights and biases tensors should be initialized with the values inferred with the training

        // Set memory manager where allowed to manage internal memory requirements
        conv0   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv1   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv2   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv3   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv4   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv5   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv6   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv7   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
        conv8   = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);

        /* [Initialize tensors] */

        // Initialize src tensor
        constexpr unsigned int width_src_image  = 288;
        constexpr unsigned int height_src_image = 288;
        constexpr unsigned int ifm_src_img      = 3;

        const TensorShape src_shape(width_src_image, height_src_image, ifm_src_img);
        src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32));

        // Initialize tensors of conv0
        constexpr unsigned int kernel_x_conv0 = 3;
        constexpr unsigned int kernel_y_conv0 = 3;
        constexpr unsigned int ofm_conv0      = 8;

        const TensorShape weights_shape_conv0(kernel_x_conv0, kernel_y_conv0, src_shape.z(), ofm_conv0);
        const TensorShape biases_shape_conv0(weights_shape_conv0[3]);
        const TensorShape out_shape_conv0(src_shape.x(), src_shape.y(), weights_shape_conv0[3]);

        weights0.allocator()->init(TensorInfo(weights_shape_conv0, 1, DataType::F32));
        biases0.allocator()->init(TensorInfo(biases_shape_conv0, 1, DataType::F32));
        out_conv0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));

        // Initialize tensors of batch0
        const TensorShape fm_shape_batch0(out_shape_conv0.z());

        mean0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
        var0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
        gamma0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
        beta0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
        out_batch0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));        

        // Initialize tensor of act0
        out_act0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));

        // Initialize tensor of pool0
        TensorShape out_shape_pool0 = out_shape_conv0;
        out_shape_pool0.set(0, out_shape_pool0.x() / 2);
        out_shape_pool0.set(1, out_shape_pool0.y() / 2);
        out_pool0.allocator()->init(TensorInfo(out_shape_pool0, 1, DataType::F32));

        // Initialize tensors of conv1
        constexpr unsigned int kernel_x_conv1 = 3;
        constexpr unsigned int kernel_y_conv1 = 3;
        constexpr unsigned int ofm_conv1      = 16;

        const TensorShape weights_shape_conv1(kernel_x_conv1, kernel_y_conv1, out_shape_pool0.z(), ofm_conv1);

        const TensorShape biases_shape_conv1(weights_shape_conv1[3]);
        const TensorShape out_shape_conv1(out_shape_pool0.x(), out_shape_pool0.y(), weights_shape_conv1[3]);

        weights1.allocator()->init(TensorInfo(weights_shape_conv1, 1, DataType::F32));
        biases1.allocator()->init(TensorInfo(biases_shape_conv1, 1, DataType::F32));
        out_conv1.allocator()->init(TensorInfo(out_shape_conv1, 1, DataType::F32));

        // Initialize tensors of batch1
        const TensorShape fm_shape_batch1(out_shape_conv1.z());

        mean1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
        var1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
        gamma1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
        beta1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
        out_batch1.allocator()->init(TensorInfo(out_shape_conv1, 1, DataType::F32));        

        // Initialize tensor of act1
        out_act1.allocator()->init(TensorInfo(out_shape_conv1, 1, DataType::F32));

        // Initialize tensor of pool1
        TensorShape out_shape_pool1 = out_shape_conv1;
        out_shape_pool1.set(0, out_shape_pool1.x() / 2);
        out_shape_pool1.set(1, out_shape_pool1.y() / 2);
        out_pool1.allocator()->init(TensorInfo(out_shape_pool1, 1, DataType::F32));
.
.
.

        // Initialize tensors of conv8
        constexpr unsigned int kernel_x_conv8 = 1;
        constexpr unsigned int kernel_y_conv8 = 1;
        constexpr unsigned int ofm_conv8      = 30;

        const TensorShape weights_shape_conv8(kernel_x_conv8, kernel_y_conv8, out_shape_conv7.z(), ofm_conv8);
        const TensorShape biases_shape_conv8(weights_shape_conv8[3]);
        const TensorShape out_shape_conv8(out_shape_conv7.x(), out_shape_conv7.y(), weights_shape_conv8[3]);

        weights8.allocator()->init(TensorInfo(weights_shape_conv8, 1, DataType::F32));
        biases8.allocator()->init(TensorInfo(biases_shape_conv8, 1, DataType::F32));
        out_conv8.allocator()->init(TensorInfo(out_shape_conv8, 1, DataType::F32));

        // Initialize tensor of act8
        out_act8.allocator()->init(TensorInfo(out_shape_conv8, 1, DataType::F32));

        /* -----------------------End: [Initialize tensors] */

        /* [Configure functions] */

        // in:288x288x3: 3x3 convolution, 8 output features maps (OFM)
        conv0->configure(&src, &weights0, &biases0, &out_conv0, PadStrideInfo(1 /* stride_x */, 1 /* stride_y */, 1 /* pad_x */, 1 /* pad_y */));
        // in:288x288x8, out:288x288x8, Batch Normalization
        batch0.configure(&out_conv0, &out_batch0, &mean0, &var0, &beta0, &gamma0, 0.0001f);
        // in:288x288x8, out:288x288x8, Activation function: leaky relu
        act0.configure(&out_batch0, &out_act0, ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU)); //need to check Leaky relu speed
        // in:288x288x8, out:144x144x8 (2x2 pooling), Pool type function: Max
        pool0.configure(&out_act0, &out_pool0, PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2 /* stride_x */, 2 /* stride_y */)));

.
.
.
        // in:9x9x256: 1x1 convolution, 30 output features maps (OFM)
        conv8->configure(&out_act7, &weights8, &biases8, &out_conv8, PadStrideInfo(1, 1, 0, 0));        
        // in:9x9x30, out:9x9x30, Activation function: leaky relu
        act8.configure(&out_conv8, &out_act8, ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU));

        /* -----------------------End: [Configure functions] */

        /*[ Add tensors to memory manager ]*/

        // We need 2 memory groups for handling the input and output
        // We call explicitly allocate after manage() in order to avoid overlapping lifetimes
        memory_group0 = arm_compute::support::cpp14::make_unique<MemoryGroup>(mm_transitions);
        memory_group1 = arm_compute::support::cpp14::make_unique<MemoryGroup>(mm_transitions);

        memory_group0->manage(&out_conv0);
        out_conv0.allocator()->allocate();
        memory_group1->manage(&out_batch0);
        out_batch0.allocator()->allocate();
        memory_group0->manage(&out_act0);
        out_act0.allocator()->allocate();
        memory_group1->manage(&out_pool0);
        out_pool0.allocator()->allocate();
.
.
.
        memory_group1->manage(&out_conv8);
        out_conv8.allocator()->allocate();
        memory_group0->manage(&out_act8);
        out_act8.allocator()->allocate();

        /* -----------------------End: [ Add tensors to memory manager ] */

        /* [Allocate tensors] */

        // Now that the padding requirements are known we can allocate all tensors
        src.allocator()->allocate();
        weights0.allocator()->allocate(); biases0.allocator()->allocate();
.
.
.

        weights8.allocator()->allocate(); biases8.allocator()->allocate();

        mean0.allocator()->allocate(); var0.allocator()->allocate(); beta0.allocator()->allocate(); gamma0.allocator()->allocate();
.
.
.

        mean7.allocator()->allocate(); var7.allocator()->allocate(); beta7.allocator()->allocate(); gamma7.allocator()->allocate();

        /* -----------------------End: [Allocate tensors] */        

        // Finalize layers memory manager

        // Set allocator that the memory manager will use
        mm_layers->set_allocator(&allocator);

        // Number of pools that the manager will create. This specifies how many layers you want to run in parallel
        mm_layers->set_num_pools(1);

        // Finalize the manager. (Validity checks, memory allocations etc)
        mm_layers->finalize();

        // Finalize transitions memory manager

        // Set allocator that the memory manager will use
        mm_transitions->set_allocator(&allocator);

        // Number of pools that the manager will create. This specifies how many models we can run in parallel.
        // Setting to 2 as we need one for the input and one for the output at any given time
        mm_transitions->set_num_pools(2);

        // Finalize the manager. (Validity checks, memory allocations etc)
        mm_transitions->finalize();

    }
    void do_run() override
    {
        // Acquire memory for the memory groups
        memory_group0->acquire();
        memory_group1->acquire();

        for(int i = 0; i < 10; i++){
            double start = now_ms();

            conv0->run();
            batch0.run();        
            act0.run();
            pool0.run();

            conv1->run();
            batch1.run();        
            act1.run();
            pool1.run();

            conv2->run();
            batch2.run();        
            act2.run();
            pool2.run();

            conv3->run();
            batch3.run();        
            act3.run();
            pool3.run();

            conv4->run();
            batch4.run();        
            act4.run();
            pool4.run();

            conv5->run();
            batch5.run();
            act5.run();

            conv6->run();
            batch6.run();
            act6.run();

            conv7->run();
            batch7.run();
            act7.run();

            conv8->run();        
            act8.run();

            double duration = now_ms() - start;

            std::cout << duration << "ms\n";

        }

        // Release memory
        memory_group0->release();
        memory_group1->release();
    }

The outputs on my device are below. Graph ex output,

186.173ms
166.912ms
0.00317383ms
170.273ms
166.14ms
0.00292969ms
169.4ms
167.487ms
0.00219727ms
172.424ms

About small computation time, I read #311. Thus, In v17.12 I modified the code follwing here But now I use v18.01 and don't modified anything.

Neon_ex output

312.414ms
203.059ms
201.16ms
200.998ms
231.294ms
215.6ms
207.837ms
208.793ms
210.99ms
208.257ms

Thanks.

GeorgeARM commented 6 years ago

Hello @ymbaek,

In the graph API you specify: << ConvolutionMethodHint::DIRECT, this will essentially use the DirectConvolution function if possilbe (Supports 1x1, 3x3 and 5x5 convolution) to execute the deep convolution instead of using the GEMM approach. On the other hand on you second example you use explicitly the ConvolutionLayer which is GEMM based. Can you align the two programs to use the same convolution functions?

iam10010 commented 6 years ago

Hello @GeorgeARM,

You're right. Thank you very much for your helping. I've aligned my programs to use the same convolution functions as your advise and then could get the similar performances.

Using DirectConvolution function, I could get about 180ms average elapsed time(10 iter.) for both my examples GraphEx and Non-GraphEx. Using the GEMM approach I could get about 205ms for both examples.

Many thanks.

ARM-software / ComputeLibrary

Performance difference between graph example and non-graph example #353