Performance of single precision floating point multiplication using CMSIS-DSP

latreppuw commented 3 years ago

Using an ARM Cortex A7, I'm optimizing an element wise multiplication of two single dimensional arrays using arm_mult_f32.

compiler: gcc 9.3.0 flags: -marm -mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a7 -fstack-protector-strong -funsafe-math-optimizations -Wformat -Wformat-security -Werror=format-security -O3 definitions: ARM_MATH_NEON ARM_MATH_LOOPUNROLL

Considering a benchmark of this simple function:


void main() {

    unsigned int N = 100000000;   
    unsigned int blockSize = 32;

    float32_t vinp1[32] __attribute__((aligned (128)));
    float32_t vinp2[32] __attribute__((aligned (128)));
    float32_t voutp[32] __attribute__((aligned (128)));

    while (N--)
    {                  
        arm_mult_f32(&vinp1[0], &vinp2[0], &voutp[0], blockSize);               
    }      
}

reveals a runtime of ~13.39 secs, i.e. ~4.18ns per multiplication or roughly 4 cycles for one floating point multiplication (assuming ~1GHz CPU clock rate). The numbers do not change significantly if the blockSize is increased.

What is the expectation here? I had rather expected something in the order of 1 cycle/multiplication with NEON optimizations.

christophe0606 commented 3 years ago

@latreppuw I am not too familiar with Cortex-A7 but from the spec I can see that Load / Store is 64 bits

So, already 6 cycles for memory accesses at least (2 load and 1 store) and 1 for the mul.

It is still far from the 16 cycles you have for the loop (4 cycle per sample).

So I don't know where the remaining 9 cycles are coming from.

gcc should unroll the loop a little so the loop management overhead should be amortized and not have big impact. But to be checked on the generated asm code.

There is a branch predictor so this part of the loop should also not be a big problem.

latreppuw commented 3 years ago

The relevant ASM code seems to be straightforward, two loads, the multiplication and the store instruction plus loop management:

.L3:
    vld1.32 {d18-d19}, [lr]!    @ _35, MEM[(const float[4] *)pSrcA_46]
    vld1.32 {d16-d17}, [r4]!    @ _34, MEM[(const float[4] *)pSrcB_49]
    subs    ip, ip, #1  @ blkCnt, blkCnt,
    vmul.f32    q8, q8, q9  @ tmp134, _34, _35
    vst1.32 {d16-d17}, [r5]!    @ tmp134, MEM[(float[4] *)pDst_52]
    bne .L3     @,

I see little scope left for improvements. Anybody some idea what I could check next?

christophe0606 commented 3 years ago

@latreppuw I think the Neon on Cortex-A7 may not have 4 multiply. So the vmul will take more than 1 cycles.

There will probably be a stall between the 2 load.

ARM-software / CMSIS_5

Performance of single precision floating point multiplication using CMSIS-DSP #1300