ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
2.87k stars 783 forks source link

How to use different thread settings in different operators? #1038

Closed GGGGxxxxxxxxr closed 1 year ago

GGGGxxxxxxxxr commented 1 year ago

I have used the ArmComputeLibrary on Android CPU with NEON.

As I have self-implemented an operator for Gemm_int32_to_int8, the whole model I have built has better performance on 4 thread settings, while this operator does not have too much workload, thus using multi-thread would decrease the performance.

I wonder whether I could set different threads for different operators?

I use NEScheduler::get().set_num_threads() for the whole model threading setting.

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

I think the best option is to override the kernel's virtual bool is_parallelisable() const method to return false. This will make the scheduler execute the kernel using just one thread.

https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/core/IKernel.h#L49

Calling to NEScheduler::get().set_num_threads() after the operators have been configured may cause problems at runtime.

Hope this helps.

GGGGxxxxxxxxr commented 1 year ago

Hi @morgolock ,

Thanks for your reply!

I have already tried this in my code, and I found that, when I set the specific kernel as "non_parallelsiable", this kernel would execute with single thread, but the other operators which should be executed on 4 threads have very unstable performance compared with 4 threading in the whole model for every operator.

It seems like the switching between single thread and multi-thread would incur something uncommon.

Would you please check this issue?

GGGGxxxxxxxxr commented 1 year ago

My testing has been conducted in Android Platform with SAMSUNG S10.

GGGGxxxxxxxxr commented 1 year ago

The problem could be easily reproduced if the NEArithmeticAddition backend CpuAdd kernel has been set as "non_parallelsiable". Create a simple model with 3 NEGEMM Layers and one NEArithmeticAddition Layer. Set NEScheduler::get().set_num_threads(4).

The execution time would be much longer than the same model under normal CpuAddKernel setting with 4 threads.

GGGGxxxxxxxxr commented 1 year ago

I have attached a very simple test code for this kind of unstable performance under threading switching.

#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/Allocator.h"
#include "arm_compute/runtime/BlobLifetimeManager.h"
#include "arm_compute/runtime/MemoryManagerOnDemand.h"
#include "arm_compute/runtime/PoolManager.h"
#include "utils/Utils.h"
#include "support/ToolchainSupport.h"
#include "src/core/NEON/NEMath.h"
#include "src/core/NEON/wrapper/intrinsics/intrinsics.h"
#include "utils/ImageLoader.h"
#include <arm_neon.h>

#include <cstdlib>
#include <sstream>
#include <time.h>

#include <unordered_map>
#include <utility>

using namespace arm_compute;
int main()
{   
    std::cout<<"\n\nBaselineTest for Gemm Test...\n";

    NEScheduler::get().set_num_threads(4);

    TensorShape matrixA_shape(27, 518400);
    TensorShape matrixB_shape(4, 27);
    TensorShape matrixDST_shape(4, 518400);

    Tensor fp_src_nhwc;
    Tensor fp_weight_nhwc;
    Tensor fp_out_nhwc;
    fp_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp2_src_nhwc;
    Tensor fp2_weight_nhwc;
    Tensor fp2_out_nhwc;
    fp2_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp2_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp2_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp3_src_nhwc;
    Tensor fp3_weight_nhwc;
    Tensor fp3_out_nhwc;
    fp3_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp3_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp3_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp4_src_nhwc;
    Tensor fp4_weight_nhwc;
    Tensor fp4_out_nhwc;
    fp4_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
    fp4_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
    fp4_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    Tensor fp_bias_nhwc, fp_add_out_nhwc;
    fp_bias_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
    fp_add_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));

    NEGEMM fp_nhwc;
    fp_nhwc.configure(&fp_src_nhwc, &fp_weight_nhwc, nullptr, &fp_out_nhwc, 1, 0);
    NEGEMM fp_nhwc_2;
    fp_nhwc_2.configure(&fp2_src_nhwc, &fp2_weight_nhwc, nullptr, &fp2_out_nhwc, 1, 0);
    NEGEMM fp_nhwc_3;
    fp_nhwc_3.configure(&fp3_src_nhwc, &fp3_weight_nhwc, nullptr, &fp3_out_nhwc, 1, 0);
    NEGEMM fp_nhwc_4;
    fp_nhwc_4.configure(&fp4_src_nhwc, &fp4_weight_nhwc, nullptr, &fp4_out_nhwc, 1, 0);

    NEArithmeticAddition fp_add;
    fp_add.configure(&fp_out_nhwc, &fp_bias_nhwc, &fp_add_out_nhwc, ConvertPolicy::SATURATE);

    //tensor allocation
    fp_src_nhwc.allocator()->allocate();
    fp_weight_nhwc.allocator()->allocate();
    fp_out_nhwc.allocator()->allocate();
    fp_bias_nhwc.allocator()->allocate();
    fp_add_out_nhwc.allocator()->allocate();
    fp2_src_nhwc.allocator()->allocate();
    fp2_weight_nhwc.allocator()->allocate();
    fp2_out_nhwc.allocator()->allocate();
    fp3_src_nhwc.allocator()->allocate();
    fp3_weight_nhwc.allocator()->allocate();
    fp3_out_nhwc.allocator()->allocate();
    fp4_src_nhwc.allocator()->allocate();
    fp4_weight_nhwc.allocator()->allocate();
    fp4_out_nhwc.allocator()->allocate();

    //

    int test_loop = 50;
    auto start = std::chrono::high_resolution_clock::now();
    for(int i =0; i < test_loop; i++)
    {
        fp_nhwc.run();

        //fp_add.run();
        auto start = std::chrono::high_resolution_clock::now();
        fp_nhwc_2.run();
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end - start;
        std::cout<<"time cost is: "<<diff.count() * 1000 <<"ms\n";
        //fp_add.run();

        fp_nhwc_3.run();

        //fp_add.run();

        fp_nhwc_4.run();

        fp_add.run();
        fp_add.run();
        fp_add.run();

    }
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> diff = end - start;
    std::cout<<"total time cost for the simple model is: "<<diff.count() * 1000 / test_loop <<"ms\n";

    return 0;

}

In CpuAddKernel.cpp, I have adjust bool is_parallelsiable() return false.

I have used the following LibraryBuilding Command for this:

scons Werror=0 -j8 debug=0 neon=1 opencl=0 benchmark_examples=0 os=android arch=arm64-v8a

If you run this code, you could tell from the execution time of FP_NHWC_2() which the performance is quite unstable compared with the default CpuAddKernel parallelsiable TRUE setting.

morgolock commented 1 year ago

Hi @GGGGxxxxxxxxr

You could try building the library with cppthreads=0 openmp=1 to enable the openmp scheduler.

Hope this helps.