Closed caseymillsdavis closed 4 years ago
I tested arm_scale_f32 on random data of different length, for comparison. The last column is milliseconds.
v5.4.0
arm_scale_f32 N= 2048 63
arm_scale_f32 N= 4096 126
arm_scale_f32 N= 8192 238
arm_scale_f32 N= 16384 528
arm_scale_f32 N= 32768 1115
v5.6.0
arm_scale_f32 N= 2048 73
arm_scale_f32 N= 4096 135
arm_scale_f32 N= 8192 300
arm_scale_f32 N= 16384 608
arm_scale_f32 N= 32768 1289
Also attaching the disassembly of the two variants. arm_scale_f32_v560.txt arm_scale_f32_v540.txt
I can get identical code-generation as V540 merely by using temporary variable and grouping the stores at the end of the loop (the hand-interleaving in the original code doesn't get me anything extra).
while (blkCnt > 0U)
{
float32_t in1, in2, in3, in4;
in1 = *pSrc++ * scale;
in2 = *pSrc++ * scale;
in3 = *pSrc++ * scale;
in4 = *pSrc++ * scale;
*pDst++ = in1;
*pDst++ = in2;
*pDst++ = in3;
*pDst++ = in4;
blkCnt--;
}
Thanks for the analysis.
CMSIS-DSP is a tradeoff between optimization, portability (written C and with minimum number of different versions) and readability.
As consequence, we cannot avoid some variability of the performance for different compilers and different versions of the source code.
It is not possible to satisfy everybody. CMSIS-DSP should be seen as a good enough start and for a specific environment it is often possible to do better.
That being said, I have committed your change in commit https://github.com/ARM-software/CMSIS_5/commit/021148901181a09801bc29aa9365e910c179edc6
I have not seen any difference when building with a different compiler but since the change is minor I don't see any problem at committing it if it can help.
Appreciated.
I recently upgraded a project from 5.4.0 to 5.6.0 and noticed that a performance-critical algorithm got about 17% slower. I traced it down to arm_scale_f32. The difference seems to be in a change to how the loop is being unrolled.
In 5.6.0 it is the very straightforward:
And in 5.4.0:
So in the older version the code is structured to maybe help the compiler along and this is a pattern that recurs in other DSP functions. Why are we not retaining it here? I'm using this library because it has these sorts of hacks - I need the performance!
Note that I'm targetting a CM7 with arm-gcc-7.3.1 and, in general, optimizing for size. (Attributing this function with -O3 had no affect).