Closed marckleinebudde closed 2 years ago
any idea where the bottleneck is at -Os ?
The generated code is optimized for size, not for speed. This is exactly what the numbers show: bigger code more throughput.
But where, don't know. I've never done performance profiling on µC before.
Profiling on M0 cores is not trivial, without ITM or SWO.
At least openocd can do "random" sampling of PC, which I've tried for the first time just now with an -Og
build :
Flat profile:
Each sample counts as 5.66027e-05 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
10.77 0.06 0.06 PCD_EP_ISR_Handler
9.40 0.11 0.05 main
7.86 0.16 0.04 HAL_PCD_IRQHandler
6.36 0.19 0.04 USB_EPStartXfer
5.38 0.22 0.03 USB_ReadPMA
5.19 0.25 0.03 USB_WritePMA
4.14 0.28 0.02 led_update_normal_mode
3.72 0.30 0.02 can_parse_error_status
3.46 0.32 0.02 USB_ReadInterrupts
2.93 0.33 0.02 led_set
2.77 0.35 0.02 led_update
2.44 0.36 0.01 HAL_PCD_EP_DB_Receive
2.26 0.38 0.01 HAL_PCD_EP_Receive
2.21 0.39 0.01 can_send
2.13 0.40 0.01 HAL_PCD_EP_Transmit
1.98 0.41 0.01 USBD_GS_CAN_DataOut
1.74 0.42 0.01 send_to_host
1.66 0.43 0.01 USBD_GS_CAN_SendFrame
1.65 0.44 0.01 list_add_locked
1.51 0.45 0.01 HAL_GPIO_WritePin
1.50 0.46 0.01 USBD_GS_CAN_TxReady
1.42 0.47 0.01 list_add_tail_locked
1.37 0.47 0.01 USBD_LL_DataInStage
1.23 0.48 0.01 USBD_LL_DataOutStage
1.15 0.49 0.01 HAL_GetTick
1.13 0.49 0.01 USBD_LL_Start
1.05 0.50 0.01 USBD_GS_CAN_Transmit
0.97 0.50 0.01 USBD_GS_CAN_PrepareReceive
0.96 0.51 0.01 USB_Handler
0.74 0.51 0.00 timer_get
0.70 0.52 0.00 can_get_error_status
0.68 0.52 0.00 HAL_PCD_DataOutStageCallback
0.67 0.53 0.00 led_indicate_trx
0.62 0.53 0.00 HAL_PCD_DataInStageCallback
0.61 0.53 0.00 USBD_LL_PrepareReceive
0.59 0.54 0.00 USBD_LL_Transmit
0.54 0.54 0.00 can_find_free_mailbox
0.48 0.54 0.00 USBD_LL_GetRxDataSize
0.47 0.54 0.00 status_is_active
0.45 0.55 0.00 can_is_rx_pending
0.36 0.55 0.00 HAL_PCD_EP_GetRxCount
0.35 0.55 0.00 USBD_GS_CAN_DataIn
0.32 0.55 0.00 USBD_GS_CAN_DfuDetachRequested
0.32 0.55 0.00 HAL_PCD_SetupStageCallback
0.24 0.56 0.00 led_set_sequence_step
0.22 0.56 0.00 can_enable
0.14 0.56 0.00 can_receive
0.13 0.56 0.00 assert_failed
0.13 0.56 0.00 USBD_GS_CAN_GetStrDesc
0.12 0.56 0.00 HAL_GPIO_TogglePin
0.12 0.56 0.00 USBD_GS_CAN_Start
0.09 0.56 0.00 can_is_enabled
0.09 0.56 0.00 HAL_PCD_EP_DB_Transmit
0.09 0.56 0.00 led_run_sequence
0.08 0.56 0.00 HAL_PCDEx_ActivateLPM
0.07 0.56 0.00 HAL_PCD_SOFCallback
0.07 0.56 0.00 USBD_LL_SOF
0.06 0.56 0.00 NMI_Handler
0.04 0.56 0.00 HAL_PCDEx_LPM_Callback
0.03 0.56 0.00 HardFault_Handler
0.02 0.56 0.00 SysTick_Handler
0.01 0.56 0.00 USBD_GS_CAN_SOF
0.01 0.56 0.00 HAL_SYSTICK_Callback
0.01 0.56 0.00 HAL_SYSTICK_IRQHandler
(to get this, I run openocd as a gdb remote, and from gdb I enter monitor profile 5 test.out 0x8000000 0x8100000
, while the target was running a "cansequence -p 10" test. Then, gprof cannette_fw test.out -xb
)
I'm surprised to see how much time we spend updating LEDs !!
Grounding led_update()
index 5453e736681d..8904a98036a2 100644
--- a/src/led.c
+++ b/src/led.c
@@ -137,6 +137,7 @@ static void led_update_sequence(led_data_t *leds)
void led_update(led_data_t *leds)
{
+ return;
switch (leds->mode) {
case led_mode_off:
increases TX CAN bus load from 80% -> 90% (on the replace-queues) branch.
I'm in favour of moving to -O2. It shouldn't be much harder to debug than -Os, and meaningful profiling runs would benefit from a test build at -Og anyway . It may still be useful to occasionally compile at other optimization levels to see where are the bottlenecks (if any).
I still think it would be nice to keep (theoretical) support 16kB-flash targets, but there are none known at the moment... We could revisit later; possibly selective optimization (gcc has some functionality for this IIRC) on the critical paths, etc.
Would be interesting what LTO does with/to different optimization levels 🤔
We've had lto enabled since 1712fec (~ 2019) , it makes a pretty big difference on size (16912 without, vs 13880) and throughput (64% -flto, 59% without). I don't see a reason to disable flto ! I think the days of -flto breaking builds (due to UB or occasional compiler bugs) are over.
I took the liberty of adding a comment to your commit message about removal of -fno-move-loop-invariants
Merged in 0612b5017f322d8675deb8a999312179ee437e48, thanks !
We've had lto enabled since 1712fec (~ 2019) , it makes a pretty big difference on size (16912 without, vs 13880) and throughput (64% -flto, 59% without). I don't see a reason to disable flto ! I think the days of -flto breaking builds (due to UB or occasional compiler bugs) are over.
Yes, LTO brings a pretty impressive performance gain and size reduction. I sometimes turn it off to get proper entries in the linker map file.
I was wondering what LTO does during re-compiling stage if some object files are compiled with -Os
and some with -O2
.
This increases the code size, but even on a STMF042 fits into the flash:
-Os
-O2
This optimization increases the max TX CAN bus load on a STM32F072 (1 MBit/s, DLC=1) from 77% to 84%.
For completeness:
-O3
The max TX CAN bus load is 88%.