ARM-software / CMSIS_5

CMSIS Version 5 Development Repository
http://arm-software.github.io/CMSIS_5/index.html
Apache License 2.0
1.35k stars 1.08k forks source link

CMSIS-NN: convolve function - hyper parameters optimization #1107

Closed lheim closed 3 years ago

lheim commented 3 years ago

Hey all,

I've got a clarifying question regarding the optimization of the convolution function incmsis-nn. For the normal convolution: Our experimental results show an improved performance when the number of filters (therefore the output channel of a convolutional layer) is dividable by 4.

Therefore, we investigated the implementation for typical kernel sizes (3x3, 5x5, and 7x7). Unfortunately, I cannot find the corresponding code for the performance indicator we observe in arm_nn_mat_mult_kernel_s8_s16.cc for the ARM_MATH_DSP implementation (we’re using a Cortex-M4F).

When investigating the implementation, (which gets called for 2 columns of im2col — therefore calculating two pixels of the output), I see that each computation gets called for two rows of A. Therefore, for each inner loop, we calculate 2 output channels (or also called number of filters of the convolution layer). As then later seen, the accumulation over the vector results in the partial results of ch_0_out_0, ch_0_out_1, ch1_out_0, ch_1_out_0. Therefore, one calculates always the partial accumulation for two pixels (out) for two channels (ch) - right?

If that’s the case, I’m wondering why we observe a strong performance increase for a number of filters dividable by 4 and not if it’s dividable by 2.

I was wondering — do you have a quick hint where our understanding is wrong? I’d highly appreciate any input.

felix-johnny commented 3 years ago

@lheim Was the divisible by 4 or 2 experiment just on the number of output channels or was it on the input channels as well?

lheim commented 3 years ago

@lheim Was the divisible by 4 or 2 experiment just on the number of output channels or was it on the input channels as well?

We were using multiple convolution layers next to one another with the same hyper parameters, therefore the number of output and input channels was equal.

felix-johnny commented 3 years ago

In that case, unaligned access would be the cause for the difference in performance.

it isn't really about the number of output channels, but about the number of input channels. For simplicity, let's use a 1x1 kernel instead and input channel that is just a multiple of 2. The start of the int8 kernels would atleast be 4 byte aligned from TFLM. And the 4 byte alignment is important as we read 4 bytes at a time for optimal memory access. But, after the first channel_input number of bytes are traversed, the 4 byte alignment is lost and the subsequent channel_input number of bytes will result in unaligned accesses. In this case, only the alternate channel_input number of bytes would be aligned. I would recommend using channel input as a multiple of 4 in this case. Similarly for fully connected, k(input vector length) is recommended to be a multiple of 4.

Hope that helps!

lheim commented 3 years ago

Thanks for the clarification! This makes sense and matches our experimental results.