Cortex-M4 loop alignment

rsaxvc commented 1 year ago

Feature request

Is your feature request related to a problem? Please describe. I'd like to suggest a potential small optimization for Cortex-M3 and M4, possibly M7.

Describe the solution you'd like This ticket tracks the testing and tuning of different loop alignment options for ArduPilot and CMSIS. I'd like feedback on proposed loop tuning.

Describe alternatives you've considered An alternative would be to keep it as-is, which results in no additional loop padding(wasted space).

Platform [ ] All [ ] AntennaTracker [X] Copter [ ] Plane [ ] Rover [ ] Submarine

Additional context On several M4 devices, I've run into a per-iteration penalty cycle for loops that start with a 32-bit instruction that is only 16-bit aligned. I've noticed a few places this occurs in CMSIS in ArduPilot, and wanted to suggest considering tuning the -falign-loops=4 parameter, which would result in adding a narrow NOP(1 loop entry overhead) just before each unaligned loop(even those starting with a narrow instruction). Here's an example from CMSIS:

08138138 <arm_mult_f32>:
 8138138:       ea5f 0c93       movs.w  ip, r3, lsr #2
 813813c:       b4f0            push    {r4, r5, r6, r7}
 813813e:       d033            beq.n   81381a8 <arm_mult_f32+0x70>
 8138140:       f100 0610       add.w   r6, r0, #16
 8138144:       f101 0510       add.w   r5, r1, #16
 8138148:       f102 0410       add.w   r4, r2, #16
 813814c:       4667            mov     r7, ip
//loop to process 4 samples per iteration
 813814e:       ed15 7a04       vldr    s14, [r5, #-16] //This loop starts with a 16-bit aligned 32-bit wide instruction, which requires two 32-bit instruction fetches
 8138152:       3f01            subs    r7, #1
 8138154:       ed56 7a04       vldr    s15, [r6, #-16]
 8138158:       f105 0510       add.w   r5, r5, #16
 813815c:       f106 0610       add.w   r6, r6, #16
 8138160:       f104 0410       add.w   r4, r4, #16
 8138164:       ee67 7a87       vmul.f32        s15, s15, s14
 8138168:       ed44 7a08       vstr    s15, [r4, #-32] ; 0xffffffe0
 813816c:       ed15 7a07       vldr    s14, [r5, #-28] ; 0xffffffe4
 8138170:       ed56 7a07       vldr    s15, [r6, #-28] ; 0xffffffe4
 8138174:       ee67 7a87       vmul.f32        s15, s15, s14
 8138178:       ed44 7a07       vstr    s15, [r4, #-28] ; 0xffffffe4
 813817c:       ed15 7a06       vldr    s14, [r5, #-24] ; 0xffffffe8
 8138180:       ed56 7a06       vldr    s15, [r6, #-24] ; 0xffffffe8
 8138184:       ee67 7a87       vmul.f32        s15, s15, s14
 8138188:       ed44 7a06       vstr    s15, [r4, #-24] ; 0xffffffe8
 813818c:       ed56 7a05       vldr    s15, [r6, #-20] ; 0xffffffec
 8138190:       ed15 7a05       vldr    s14, [r5, #-20] ; 0xffffffec
 8138194:       ee67 7a87       vmul.f32        s15, s15, s14
 8138198:       ed44 7a05       vstr    s15, [r4, #-20] ; 0xffffffec
 813819c:       d1d7            bne.n   813814e <arm_mult_f32+0x16> //Branch target is aligned on 2
 813819e:       ea4f 140c       mov.w   r4, ip, lsl #4
 ...rest of function handles remaining 0-3 samples, and has been omitted...

I suspect M7 has a 64-bit or wider fetch interface, so this would may impact it when a 32-bit instruction stradles two instruction fetches. And Cortex-M7 has a branch predictor, so may not suffer from this as often as the M4.

magicrub commented 1 year ago

Have you tried compiling with that option? How does it effect code-size?

tridge commented 1 year ago

@rsaxvc definately worth trying, do you need help benchmarking this?

rsaxvc commented 1 year ago

There's two parts here, switching CMSIS and switching ArduPilot.

@magicrub , adding -falign-loops=4 to just the ArduPilot files seems to add only +20B on the KakuteH7 copter 4.4 build. I wonder if GCC is tending towards aligning most loops already on M7. Haven't rebuilt an M4 with CMSIS yet.

@tridge, help benchmarking would be great.

CMSIS might also benefit from -falign-functions like the rest of the AP codebase, which is currently aligning functions at 16B.

rsaxvc commented 1 year ago

I built a few boards to look at the differences in codespace vs instructions. In short, I don't think this is worth looking into for the main ArduPilot codebase. Each -falign=4 build added:

KakuteH7: +16-24 bytes.
Pixhawk1: +8-16 bytes.
speedybeef4v3: 0-8bytes.

Usually this impacts the following functions(which vary per build):

NavEKF2::UpdateFilter() -moves epilogue nop.n from end of function to just before loop-start to align loop. +0B
NavEKF2::requestYawReset() -adds nop.n to loop-start, pads epilogue with nop.n, +4B.
ChibiOS::RCOutput::_set_neopixel_rgb_data(...) - same as requestYawReset()
ChibiOS::RCOutput::_set_profiled_rgb_data(...) - same as requestYawReset()
ChibiOS::RCOutput::fill_DMA_buffer_serial_led(...) - same as requestYawReset()

I don't think this may not be worth the effort of profiling for the bulk of ArduPilot, because UpdateFilter() seems like the only thing possibly useful for tuning, but that loop is large enough that the penalty cycle would always be a small part.

I'll probably pick this up for the CMSIS stuff when I get back to building that from source.

ArduPilot / ardupilot

Cortex-M4 loop alignment #24635

Feature request