Closed GGGGxxxxxxxxr closed 1 year ago
Hi @GGGGxxxxxxxxr
I think the best option is to override the kernel's virtual bool is_parallelisable() const
method to return false
. This will make the scheduler execute the kernel using just one thread.
https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/core/IKernel.h#L49
Calling to NEScheduler::get().set_num_threads()
after the operators have been configured may cause problems at runtime.
Hope this helps.
Hi @morgolock ,
Thanks for your reply!
I have already tried this in my code, and I found that, when I set the specific kernel as "non_parallelsiable", this kernel would execute with single thread, but the other operators which should be executed on 4 threads have very unstable performance compared with 4 threading in the whole model for every operator.
It seems like the switching between single thread and multi-thread would incur something uncommon.
Would you please check this issue?
My testing has been conducted in Android Platform with SAMSUNG S10.
The problem could be easily reproduced if the NEArithmeticAddition backend CpuAdd kernel has been set as "non_parallelsiable". Create a simple model with 3 NEGEMM Layers and one NEArithmeticAddition Layer. Set NEScheduler::get().set_num_threads(4).
The execution time would be much longer than the same model under normal CpuAddKernel setting with 4 threads.
I have attached a very simple test code for this kind of unstable performance under threading switching.
#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/Allocator.h"
#include "arm_compute/runtime/BlobLifetimeManager.h"
#include "arm_compute/runtime/MemoryManagerOnDemand.h"
#include "arm_compute/runtime/PoolManager.h"
#include "utils/Utils.h"
#include "support/ToolchainSupport.h"
#include "src/core/NEON/NEMath.h"
#include "src/core/NEON/wrapper/intrinsics/intrinsics.h"
#include "utils/ImageLoader.h"
#include <arm_neon.h>
#include <cstdlib>
#include <sstream>
#include <time.h>
#include <unordered_map>
#include <utility>
using namespace arm_compute;
int main()
{
std::cout<<"\n\nBaselineTest for Gemm Test...\n";
NEScheduler::get().set_num_threads(4);
TensorShape matrixA_shape(27, 518400);
TensorShape matrixB_shape(4, 27);
TensorShape matrixDST_shape(4, 518400);
Tensor fp_src_nhwc;
Tensor fp_weight_nhwc;
Tensor fp_out_nhwc;
fp_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
fp_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
fp_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
Tensor fp2_src_nhwc;
Tensor fp2_weight_nhwc;
Tensor fp2_out_nhwc;
fp2_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
fp2_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
fp2_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
Tensor fp3_src_nhwc;
Tensor fp3_weight_nhwc;
Tensor fp3_out_nhwc;
fp3_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
fp3_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
fp3_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
Tensor fp4_src_nhwc;
Tensor fp4_weight_nhwc;
Tensor fp4_out_nhwc;
fp4_src_nhwc.allocator()->init(TensorInfo(matrixA_shape, 1, DataType::F32, DataLayout::NHWC));
fp4_weight_nhwc.allocator()->init(TensorInfo(matrixB_shape, 1, DataType::F32, DataLayout::NHWC));
fp4_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
Tensor fp_bias_nhwc, fp_add_out_nhwc;
fp_bias_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
fp_add_out_nhwc.allocator()->init(TensorInfo(matrixDST_shape, 1, DataType::F32, DataLayout::NHWC));
NEGEMM fp_nhwc;
fp_nhwc.configure(&fp_src_nhwc, &fp_weight_nhwc, nullptr, &fp_out_nhwc, 1, 0);
NEGEMM fp_nhwc_2;
fp_nhwc_2.configure(&fp2_src_nhwc, &fp2_weight_nhwc, nullptr, &fp2_out_nhwc, 1, 0);
NEGEMM fp_nhwc_3;
fp_nhwc_3.configure(&fp3_src_nhwc, &fp3_weight_nhwc, nullptr, &fp3_out_nhwc, 1, 0);
NEGEMM fp_nhwc_4;
fp_nhwc_4.configure(&fp4_src_nhwc, &fp4_weight_nhwc, nullptr, &fp4_out_nhwc, 1, 0);
NEArithmeticAddition fp_add;
fp_add.configure(&fp_out_nhwc, &fp_bias_nhwc, &fp_add_out_nhwc, ConvertPolicy::SATURATE);
//tensor allocation
fp_src_nhwc.allocator()->allocate();
fp_weight_nhwc.allocator()->allocate();
fp_out_nhwc.allocator()->allocate();
fp_bias_nhwc.allocator()->allocate();
fp_add_out_nhwc.allocator()->allocate();
fp2_src_nhwc.allocator()->allocate();
fp2_weight_nhwc.allocator()->allocate();
fp2_out_nhwc.allocator()->allocate();
fp3_src_nhwc.allocator()->allocate();
fp3_weight_nhwc.allocator()->allocate();
fp3_out_nhwc.allocator()->allocate();
fp4_src_nhwc.allocator()->allocate();
fp4_weight_nhwc.allocator()->allocate();
fp4_out_nhwc.allocator()->allocate();
//
int test_loop = 50;
auto start = std::chrono::high_resolution_clock::now();
for(int i =0; i < test_loop; i++)
{
fp_nhwc.run();
//fp_add.run();
auto start = std::chrono::high_resolution_clock::now();
fp_nhwc_2.run();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout<<"time cost is: "<<diff.count() * 1000 <<"ms\n";
//fp_add.run();
fp_nhwc_3.run();
//fp_add.run();
fp_nhwc_4.run();
fp_add.run();
fp_add.run();
fp_add.run();
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout<<"total time cost for the simple model is: "<<diff.count() * 1000 / test_loop <<"ms\n";
return 0;
}
In CpuAddKernel.cpp, I have adjust bool is_parallelsiable() return false.
I have used the following LibraryBuilding Command for this:
scons Werror=0 -j8 debug=0 neon=1 opencl=0 benchmark_examples=0 os=android arch=arm64-v8a
If you run this code, you could tell from the execution time of FP_NHWC_2() which the performance is quite unstable compared with the default CpuAddKernel parallelsiable TRUE setting.
Hi @GGGGxxxxxxxxr
You could try building the library with cppthreads=0 openmp=1
to enable the openmp scheduler.
Hope this helps.
I have used the ArmComputeLibrary on Android CPU with NEON.
As I have self-implemented an operator for Gemm_int32_to_int8, the whole model I have built has better performance on 4 thread settings, while this operator does not have too much workload, thus using multi-thread would decrease the performance.
I wonder whether I could set different threads for different operators?
I use NEScheduler::get().set_num_threads() for the whole model threading setting.