ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn
https://developer.arm.com/products/processors/machine-learning/arm-nn
MIT License
1.14k stars 307 forks source link

Armnn insert reshape node before fullyconnected #749

Closed zxros10 closed 6 months ago

zxros10 commented 6 months ago

In my onnx model there are serverl MatMul nodes. When convert to tflite, the MatMul convert to FullyConnected. Then when execute with armnn,I got the profiling data, found before every FullyConnected insert Reshape and Expendim node, and these inserted node exhaust large times. Why armnn need insert these node ? How could I improve the performence? If I try to stop MatMul convert to FullyConnected, it will help?

13777e07-9a9b-4ee1-8403-775792c1cd18 82e3c9b5-f628-4fe4-91a9-48f55f2d5b06

MikeJKelly commented 6 months ago

Hi @zxros10

The FullyConnected implementation only supports 2D inputs so the reshape is added to flatten the 3D input to a 2D one 1x800x256 to 800x256, in the same way the ExpandDims is added to change the output from 800x256 to 1x800x256. I am surprised that these layers take a long time to run, can you post your profiling data for these layers?

Best regards, Mike

zxros10 commented 6 months ago

I use https://github.com/morgolock/vison to analyze my profiling data: Total time per kernel Percentage of total time Kernel name 24.5090 us 0.0011 activation_layer_quant_f32 43.7800 us 0.0020 activation_layer 45.5170 us 0.0021 concatenate_width_x4 46.9790 us 0.0021 elementwise_operation_DIV 70.8790 us 0.0032 concatenate_width 185.5250 us 0.0085 gemm_reshape_rhs_matrix_t 205.6450 us 0.0094 transpose 212.1030 us 0.0097 strided_slice 285.2300 us 0.0130 elementwise_operation_SUB_quantized 306.3610 us 0.0140 quantization_layer 310.0540 us 0.0141 activation_layer_quant 550.5810 us 0.0251 permute 697.3280 us 0.0318 elementwise_operation_ADD_quantized 1030.5740 us 0.0469 gemmlowp_matrix_b_reduction 1085.6590 us 0.0495 dequantization_layer 1099.9560 us 0.0501 tile 1787.1760 us 0.0814 reduction_operation_x 3003.1760 us 0.1368 pixelwise_mul_quantized 5445.1560 us 0.2480 reshape_layer 5518.2940 us 0.2514 gemmlowp_mm_reshaped_only_rhs_t_fused_output_stage_fixedpoint

The reshape_layer occupy much time. These reshape_layer are not only for FullyConnected,but also Add, Mul, and so on. For FullyConnected nodes, one of them: Reshape_for:FullyConnected:0:2_ClReshapeWorkloadExecute#286 406.52 us reshape_layer FullyConnected:0:2_ClFullyConnectedWorkloadExecute#287 44.064 us transpose 38.519 us gemm_reshape_rhs_matrix_t 173.63 us gemmlowp_matrix_b_reduction 2194.185 us gemmlowp_mm_reshaped_only_rhs_t_fused_output_stage_fixedpoint ExpandDims:0:2_ClReshapeWorkloadExecute#288 1204.807 us

The reshape _for and ExpandDims exhaust time is close to FullyConnected kernel compute time

MikeJKelly commented 6 months ago

How many iterations have you ran before getting these execution times?

The first time you run a network it will be a lot slower as the GpuBackend compiles kernels for each of the workloads during the first inference. The second and subsequent runs will be faster.

zxros10 commented 6 months ago

I run 100 iterations. At last I reduce the feature map dimensions in the model, and dismiss the reshape of fullyconnected. Thanks.