Closed latreppuw closed 3 years ago
@latreppuw I am not too familiar with Cortex-A7 but from the spec I can see that Load / Store is 64 bits
So, already 6 cycles for memory accesses at least (2 load and 1 store) and 1 for the mul.
It is still far from the 16 cycles you have for the loop (4 cycle per sample).
So I don't know where the remaining 9 cycles are coming from.
gcc should unroll the loop a little so the loop management overhead should be amortized and not have big impact. But to be checked on the generated asm code.
There is a branch predictor so this part of the loop should also not be a big problem.
The relevant ASM code seems to be straightforward, two loads, the multiplication and the store instruction plus loop management:
.L3:
vld1.32 {d18-d19}, [lr]! @ _35, MEM[(const float[4] *)pSrcA_46]
vld1.32 {d16-d17}, [r4]! @ _34, MEM[(const float[4] *)pSrcB_49]
subs ip, ip, #1 @ blkCnt, blkCnt,
vmul.f32 q8, q8, q9 @ tmp134, _34, _35
vst1.32 {d16-d17}, [r5]! @ tmp134, MEM[(float[4] *)pDst_52]
bne .L3 @,
I see little scope left for improvements. Anybody some idea what I could check next?
@latreppuw I think the Neon on Cortex-A7 may not have 4 multiply. So the vmul will take more than 1 cycles.
There will probably be a stall between the 2 load.
Using an ARM Cortex A7, I'm optimizing an element wise multiplication of two single dimensional arrays using arm_mult_f32.
compiler: gcc 9.3.0 flags: -marm -mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a7 -fstack-protector-strong -funsafe-math-optimizations -Wformat -Wformat-security -Werror=format-security -O3 definitions: ARM_MATH_NEON ARM_MATH_LOOPUNROLL
Considering a benchmark of this simple function:
reveals a runtime of ~13.39 secs, i.e. ~4.18ns per multiplication or roughly 4 cycles for one floating point multiplication (assuming ~1GHz CPU clock rate). The numbers do not change significantly if the blockSize is increased.
What is the expectation here? I had rather expected something in the order of 1 cycle/multiplication with NEON optimizations.