How to use different thread settings in different operators?

GGGGxxxxxxxxr commented 1 year ago

I have used the ArmComputeLibrary on Android CPU with NEON.

As I have self-implemented an operator for Gemm_int32_to_int8, the whole model I have built has better performance on 4 thread settings, while this operator does not have too much workload, thus using multi-thread would decrease the performance.

I wonder whether I could set different threads for different operators?

I use NEScheduler::get().set_num_threads() for the whole model threading setting.

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

I think the best option is to override the kernel's virtual bool is_parallelisable() const method to return false. This will make the scheduler execute the kernel using just one thread.

https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/core/IKernel.h#L49

Calling to NEScheduler::get().set_num_threads() after the operators have been configured may cause problems at runtime.

Hope this helps.

GGGGxxxxxxxxr commented 1 year ago

Hi @morgolock ,

Thanks for your reply!

I have already tried this in my code, and I found that, when I set the specific kernel as "non_parallelsiable", this kernel would execute with single thread, but the other operators which should be executed on 4 threads have very unstable performance compared with 4 threading in the whole model for every operator.

It seems like the switching between single thread and multi-thread would incur something uncommon.

Would you please check this issue?

GGGGxxxxxxxxr commented 1 year ago

My testing has been conducted in Android Platform with SAMSUNG S10.

GGGGxxxxxxxxr commented 1 year ago

The problem could be easily reproduced if the NEArithmeticAddition backend CpuAdd kernel has been set as "non_parallelsiable". Create a simple model with 3 NEGEMM Layers and one NEArithmeticAddition Layer. Set NEScheduler::get().set_num_threads(4).

The execution time would be much longer than the same model under normal CpuAddKernel setting with 4 threads.

GGGGxxxxxxxxr commented 1 year ago

I have attached a very simple test code for this kind of unstable performance under threading switching.

#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/Allocator.h"
#include "arm_compute/runtime/BlobLifetimeManager.h"
#include "arm_compute/runtime/MemoryManagerOnDemand.h"
#include "arm_compute/runtime/PoolManager.h"
#include "utils/Utils.h"
#include "support/ToolchainSupport.h"
#include "src/core/NEON/NEMath.h"
#include "src/core/NEON/wrapper/intrinsics/intrinsics.h"
#include "utils/ImageLoader.h"
#include <arm_neon.h>

#include <cstdlib>
#include <sstream>
#include <time.h>

#include <unordered_map>
#include <utility>

using namespace arm_compute;
int main()
{   
    std::cout<<"\n\nBaselineTest for Gemm Test...\n";

    NEScheduler::get().set_num_threads(4);

    TensorShape matrixA_shape(27, 518400);
    TensorShape matrixB_shape(4, 27);
    TensorShape matrixDST_shape(4, 518400);

    Tensor fp_src_nhwc;
    Tensor fp_weight_nhwc;
    Tensor fp_out_nhwc;
    fp_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp2_src_nhwc;
    Tensor fp2_weight_nhwc;
    Tensor fp2_out_nhwc;
    fp2_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp2_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp2_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp3_src_nhwc;
    Tensor fp3_weight_nhwc;
    Tensor fp3_out_nhwc;
    fp3_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp3_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp3_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp4_src_nhwc;
    Tensor fp4_weight_nhwc;
    Tensor fp4_out_nhwc;
    fp4_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp4_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp4_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp_bias_nhwc, fp_add_out_nhwc;
    fp_bias_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
    fp_add_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    NEGEMM fp_nhwc;
    fp_nhwc.configure(&fp_src_nhwc, &fp_weight_nhwc, nullptr, &fp_out_nhwc, 1, 0);
    NEGEMM fp_nhwc_2;
    fp_nhwc_2.configure(&fp2_src_nhwc, &fp2_weight_nhwc, nullptr, &fp2_out_nhwc, 1, 0);
    NEGEMM fp_nhwc_3;
    fp_nhwc_3.configure(&fp3_src_nhwc, &fp3_weight_nhwc, nullptr, &fp3_out_nhwc, 1, 0);
    NEGEMM fp_nhwc_4;
    fp_nhwc_4.configure(&fp4_src_nhwc, &fp4_weight_nhwc, nullptr, &fp4_out_nhwc, 1, 0);

    NEArithmeticAddition fp_add;
    fp_add.configure(&fp_out_nhwc, &fp_bias_nhwc, &fp_add_out_nhwc, ConvertPolicy::SATURATE);

    //tensor allocation
    fp_src_nhwc.allocator()->allocate();
    fp_weight_nhwc.allocator()->allocate();
    fp_out_nhwc.allocator()->allocate();
    fp_bias_nhwc.allocator()->allocate();
    fp_add_out_nhwc.allocator()->allocate();
    fp2_src_nhwc.allocator()->allocate();
    fp2_weight_nhwc.allocator()->allocate();
    fp2_out_nhwc.allocator()->allocate();
    fp3_src_nhwc.allocator()->allocate();
    fp3_weight_nhwc.allocator()->allocate();
    fp3_out_nhwc.allocator()->allocate();
    fp4_src_nhwc.allocator()->allocate();
    fp4_weight_nhwc.allocator()->allocate();
    fp4_out_nhwc.allocator()->allocate();

    //

    int test_loop = 50;
    auto start = std::chrono::high_resolution_clock::now();
    for(int i =0; i < test_loop; i++)
    {
        fp_nhwc.run();

        //fp_add.run();
        auto start = std::chrono::high_resolution_clock::now();
        fp_nhwc_2.run();
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end - start;
        std::cout<<"time cost is: "<<diff.count() * 1000 <<"ms\n";
        //fp_add.run();

        fp_nhwc_3.run();

        //fp_add.run();

        fp_nhwc_4.run();

        fp_add.run();
        fp_add.run();
        fp_add.run();

    }
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> diff = end - start;
    std::cout<<"total time cost for the simple model is: "<<diff.count() * 1000 / test_loop <<"ms\n";

    return 0;

}

In CpuAddKernel.cpp, I have adjust bool is_parallelsiable() return false.

I have used the following LibraryBuilding Command for this:

scons Werror=0 -j8 debug=0 neon=1 opencl=0 benchmark_examples=0 os=android arch=arm64-v8a

If you run this code, you could tell from the execution time of FP_NHWC_2() which the performance is quite unstable compared with the default CpuAddKernel parallelsiable TRUE setting.

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

You could try building the library with cppthreads=0 openmp=1 to enable the openmp scheduler.

Hope this helps.

ARM-software / ComputeLibrary

How to use different thread settings in different operators? #1038