arm_scale_f32 slow down

caseymillsdavis commented 4 years ago

I recently upgraded a project from 5.4.0 to 5.6.0 and noticed that a performance-critical algorithm got about 17% slower. I traced it down to arm_scale_f32. The difference seems to be in a change to how the loop is being unrolled.

In 5.6.0 it is the very straightforward:

  while (blkCnt > 0U)
  {
    /* C = A * scale */

    /* Scale input and store result in destination buffer. */
    *pDst++ = (*pSrc++) * scale;

    *pDst++ = (*pSrc++) * scale;

    *pDst++ = (*pSrc++) * scale;

    *pDst++ = (*pSrc++) * scale;

    /* Decrement loop counter */
    blkCnt--;
  }

And in 5.4.0:

  while (blkCnt > 0U)
  {
    /* C = A * scale */
    /* Scale the input and then store the results in the destination buffer. */
    /* read input samples from source */
    in1 = *pSrc;
    in2 = *(pSrc + 1);

    /* multiply with scaling factor */
    in1 = in1 * scale;

    /* read input sample from source */
    in3 = *(pSrc + 2);

    /* multiply with scaling factor */
    in2 = in2 * scale;

    /* read input sample from source */
    in4 = *(pSrc + 3);

    /* multiply with scaling factor */
    in3 = in3 * scale;
    in4 = in4 * scale;
    /* store the result to destination */
    *pDst = in1;
    *(pDst + 1) = in2;
    *(pDst + 2) = in3;
    *(pDst + 3) = in4;

    /* update pointers to process next samples */
    pSrc += 4U;
    pDst += 4U;

    /* Decrement the loop counter */
    blkCnt--;
  }

So in the older version the code is structured to maybe help the compiler along and this is a pattern that recurs in other DSP functions. Why are we not retaining it here? I'm using this library because it has these sorts of hacks - I need the performance!

Note that I'm targetting a CM7 with arm-gcc-7.3.1 and, in general, optimizing for size. (Attributing this function with -O3 had no affect).

caseymillsdavis commented 4 years ago

I tested arm_scale_f32 on random data of different length, for comparison. The last column is milliseconds.

v5.4.0
arm_scale_f32 N=  2048     63
arm_scale_f32 N=  4096    126
arm_scale_f32 N=  8192    238
arm_scale_f32 N= 16384    528
arm_scale_f32 N= 32768   1115

v5.6.0
arm_scale_f32 N=  2048     73
arm_scale_f32 N=  4096    135
arm_scale_f32 N=  8192    300
arm_scale_f32 N= 16384    608
arm_scale_f32 N= 32768   1289

Also attaching the disassembly of the two variants. arm_scale_f32_v560.txt arm_scale_f32_v540.txt

I can get identical code-generation as V540 merely by using temporary variable and grouping the stores at the end of the loop (the hand-interleaving in the original code doesn't get me anything extra).

  while (blkCnt > 0U)
  {
    float32_t in1, in2, in3, in4;

    in1 = *pSrc++ * scale;
    in2 = *pSrc++ * scale;
    in3 = *pSrc++ * scale;
    in4 = *pSrc++ * scale;

    *pDst++ = in1;
    *pDst++ = in2;
    *pDst++ = in3;
    *pDst++ = in4;

    blkCnt--;
  }

christophe0606 commented 4 years ago

Thanks for the analysis.

CMSIS-DSP is a tradeoff between optimization, portability (written C and with minimum number of different versions) and readability.

As consequence, we cannot avoid some variability of the performance for different compilers and different versions of the source code.

It is not possible to satisfy everybody. CMSIS-DSP should be seen as a good enough start and for a specific environment it is often possible to do better.

That being said, I have committed your change in commit https://github.com/ARM-software/CMSIS_5/commit/021148901181a09801bc29aa9365e910c179edc6

I have not seen any difference when building with a different compiler but since the change is minor I don't see any problem at committing it if it can help.

caseymillsdavis commented 4 years ago

Appreciated.

ARM-software / CMSIS_5

arm_scale_f32 slow down #874