ARM-software / CMSIS_5

CMSIS Version 5 Development Repository
http://arm-software.github.io/CMSIS_5/index.html
Apache License 2.0
1.31k stars 1.08k forks source link

CMSIS-NN arm_convolution_s8 #1446

Open jiaodawuyanzu opened 2 years ago

jiaodawuyanzu commented 2 years ago

In the arm_covsolution_s8 function, why are two functions arm_nn_mat_mul_core_4x_s8 and arm_nn_mat_mult_s8 used separately when padded = 0 or padded = 1?

felix-johnny commented 2 years ago

@jiaodawuyanzu Thanks for the question. The difference comes from the way the input offset is supposed to be handled in the padded or non-padded case. For the non-padded case we apply an optimization where one can make the core loop more efficient by an acc += ker[i] ip[i] and ker_sum += ker[i] operation rather than an acc += ker[i] (ip[i] + input_offset). This is what is done in arm_nn_mat_mul_core_4x_s8. This is a 8x8 MAC operation. This however can't be applied when there is padding and we do it as acc += ker[i] * (ip[i] + input_offset) instead. This is a 16x16 MAC.

jiaodawuyanzu commented 2 years ago

Why can't use arm_nn_mat_mul_core_4x_s8 be applied when there is padding?

felix-johnny commented 2 years ago

It would be great if we could from a performance perspective, but when there is padding the ker_sum += ker[i] operation can/should not be applied to the positions of the kernel that do not overlap with the input tensor. It is the mathematics of the optimization that decides this.

jiaodawuyanzu commented 2 years ago

Thank you for your answer, but I still don't understand why the ker_sum += ker[i] operation cannot be used when there is padding. I hope you can give me an example.

felix-johnny commented 2 years ago

@jiaodawuyanzu Maybe it is easier to see the code in pure C.. This is the implementation from TFLite Micro.. https://github.com/tensorflow/tflite-micro/blob/7f018add81d76f1049a5acc746f4ccd7d758b106/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h#L86

What that shows is that the core loop or the MAC operation is to be performed only on portions where the kernel overlaps the input tensor. For our case of acc += ker[i] ip[i] and ker_sum += ker[i], the acc += ker[i] ip[i] is not an issue as we have zero padded and the multiply operation is zero and we are good there. But, the kernel contribution will be non-zero which is incorrect. Beyond this, I would suggest to take an example of a one channel tensor(You could use the unit tests to create such a case https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN/Tests/UnitTest) and try out on how it makes a difference. Hope it helps.

jiaodawuyanzu commented 2 years ago

Thank you for your answer. However, I did an experiment. I commented out some code with padding (arm_m55_nn_mat_mult_s8) and used the arm_m55_nn_mat_mul_core_4x_s8 function. The result is correct. The acc += ker[i] * ip[i] is correct. Because the ip value is assigned to -input_offset during im2col initialization, I understand that you can just use arm_m55_nn_mat_mul_core_4x_s8.