Closed lheim closed 3 years ago
@lheim Was the divisible by 4 or 2 experiment just on the number of output channels or was it on the input channels as well?
@lheim Was the divisible by 4 or 2 experiment just on the number of output channels or was it on the input channels as well?
We were using multiple convolution layers next to one another with the same hyper parameters, therefore the number of output and input channels was equal.
In that case, unaligned access would be the cause for the difference in performance.
it isn't really about the number of output channels, but about the number of input channels. For simplicity, let's use a 1x1 kernel instead and input channel that is just a multiple of 2. The start of the int8 kernels would atleast be 4 byte aligned from TFLM. And the 4 byte alignment is important as we read 4 bytes at a time for optimal memory access. But, after the first channel_input number of bytes are traversed, the 4 byte alignment is lost and the subsequent channel_input number of bytes will result in unaligned accesses. In this case, only the alternate channel_input number of bytes would be aligned. I would recommend using channel input as a multiple of 4 in this case. Similarly for fully connected, k(input vector length) is recommended to be a multiple of 4.
Hope that helps!
Thanks for the clarification! This makes sense and matches our experimental results.
Hey all,
I've got a clarifying question regarding the optimization of the convolution function in
cmsis-nn
. For the normal convolution: Our experimental results show an improved performance when the number of filters (therefore the output channel of a convolutional layer) is dividable by 4.Therefore, we investigated the implementation for typical kernel sizes (3x3, 5x5, and 7x7). Unfortunately, I cannot find the corresponding code for the performance indicator we observe in
arm_nn_mat_mult_kernel_s8_s16.cc
for theARM_MATH_DSP
implementation (we’re using a Cortex-M4F).When investigating the implementation, (which gets called for 2 columns of
im2col
— therefore calculating two pixels of the output), I see that each computation gets called for two rows ofA
. Therefore, for each inner loop, we calculate 2 output channels (or also called number of filters of the convolution layer). As then later seen, the accumulation over the vector results in the partial results ofch_0_out_0, ch_0_out_1, ch1_out_0, ch_1_out_0
. Therefore, one calculates always the partial accumulation for two pixels (out) for two channels (ch) - right?If that’s the case, I’m wondering why we observe a strong performance increase for a number of filters dividable by 4 and not if it’s dividable by 2.
I was wondering — do you have a quick hint where our understanding is wrong? I’d highly appreciate any input.