Armnn insert reshape node before fullyconnected

zxros10 commented 6 months ago

In my onnx model there are serverl MatMul nodes. When convert to tflite, the MatMul convert to FullyConnected. Then when execute with armnn，I got the profiling data, found before every FullyConnected insert Reshape and Expendim node, and these inserted node exhaust large times. Why armnn need insert these node ? How could I improve the performence? If I try to stop MatMul convert to FullyConnected, it will help?

13777e07-9a9b-4ee1-8403-775792c1cd18 82e3c9b5-f628-4fe4-91a9-48f55f2d5b06

MikeJKelly commented 6 months ago

Hi @zxros10

The FullyConnected implementation only supports 2D inputs so the reshape is added to flatten the 3D input to a 2D one 1x800x256 to 800x256, in the same way the ExpandDims is added to change the output from 800x256 to 1x800x256. I am surprised that these layers take a long time to run, can you post your profiling data for these layers?

Best regards, Mike

zxros10 commented 6 months ago

I use https://github.com/morgolock/vison to analyze my profiling data: Percentage of total time Kernel name activation_layer_quant_f32 activation_layer concatenate_width_x4 elementwise_operation_DIV concatenate_width gemm_reshape_rhs_matrix_t transpose strided_slice elementwise_operation_SUB_quantized quantization_layer activation_layer_quant permute elementwise_operation_ADD_quantized 0.0469 gemmlowp_matrix_b_reduction 0.0495 dequantization_layer 0.0501 tile 0.0814 reduction_operation_x 0.1368 pixelwise_mul_quantized 0.2480 reshape_layer 0.2514 gemmlowp_mm_reshaped_only_rhs_t_fused_output_stage_fixedpoint

The reshape_layer occupy much time. These reshape_layer are not only for FullyConnected，but also Add, Mul, and so on. For FullyConnected nodes, one of them： Reshape_for:FullyConnected:0:2_ClReshapeWorkloadExecute#286 406.52 us reshape_layer FullyConnected:0:2_ClFullyConnectedWorkloadExecute#287 44.064 us transpose 38.519 us gemm_reshape_rhs_matrix_t 173.63 us gemmlowp_matrix_b_reduction 2194.185 us gemmlowp_mm_reshaped_only_rhs_t_fused_output_stage_fixedpoint ExpandDims:0:2_ClReshapeWorkloadExecute#288 1204.807 us

The reshape _for and ExpandDims exhaust time is close to FullyConnected kernel compute time

MikeJKelly commented 6 months ago

How many iterations have you ran before getting these execution times?

The first time you run a network it will be a lot slower as the GpuBackend compiles kernels for each of the workloads during the first inference. The second and subsequent runs will be faster.

zxros10 commented 6 months ago

I run 100 iterations. At last I reduce the feature map dimensions in the model, and dismiss the reshape of fullyconnected. Thanks.

ARM-software / armnn

Armnn insert reshape node before fullyconnected #749