cmake: switch from -Os to -O2 optimization

marckleinebudde commented 2 years ago

This increases the code size, but even on a STMF042 fits into the flash:

-Os

Memory region         Used Size  Region Size  %age Used
           FLASH:       14180 B        32 KB     43.27%
             RAM:        3744 B         6 KB     60.94%

-O2

Memory region         Used Size  Region Size  %age Used
           FLASH:       17284 B        32 KB     52.75%
             RAM:        3744 B         6 KB     60.94%

This optimization increases the max TX CAN bus load on a STM32F072 (1 MBit/s, DLC=1) from 77% to 84%.

For completeness:

-O3

Memory region         Used Size  Region Size  %age Used
           FLASH:       22568 B        32 KB     68.87%
             RAM:        3744 B         6 KB     60.94%

The max TX CAN bus load is 88%.

fenugrec commented 2 years ago

any idea where the bottleneck is at -Os ?

marckleinebudde commented 2 years ago

The generated code is optimized for size, not for speed. This is exactly what the numbers show: bigger code more throughput.

But where, don't know. I've never done performance profiling on µC before.

fenugrec commented 2 years ago

Profiling on M0 cores is not trivial, without ITM or SWO. At least openocd can do "random" sampling of PC, which I've tried for the first time just now with an -Og build :

Flat profile:

Each sample counts as 5.66027e-05 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 10.77      0.06     0.06                             PCD_EP_ISR_Handler
  9.40      0.11     0.05                             main
  7.86      0.16     0.04                             HAL_PCD_IRQHandler
  6.36      0.19     0.04                             USB_EPStartXfer
  5.38      0.22     0.03                             USB_ReadPMA
  5.19      0.25     0.03                             USB_WritePMA
  4.14      0.28     0.02                             led_update_normal_mode
  3.72      0.30     0.02                             can_parse_error_status
  3.46      0.32     0.02                             USB_ReadInterrupts
  2.93      0.33     0.02                             led_set
  2.77      0.35     0.02                             led_update
  2.44      0.36     0.01                             HAL_PCD_EP_DB_Receive
  2.26      0.38     0.01                             HAL_PCD_EP_Receive
  2.21      0.39     0.01                             can_send
  2.13      0.40     0.01                             HAL_PCD_EP_Transmit
  1.98      0.41     0.01                             USBD_GS_CAN_DataOut
  1.74      0.42     0.01                             send_to_host
  1.66      0.43     0.01                             USBD_GS_CAN_SendFrame
  1.65      0.44     0.01                             list_add_locked
  1.51      0.45     0.01                             HAL_GPIO_WritePin
  1.50      0.46     0.01                             USBD_GS_CAN_TxReady
  1.42      0.47     0.01                             list_add_tail_locked
  1.37      0.47     0.01                             USBD_LL_DataInStage
  1.23      0.48     0.01                             USBD_LL_DataOutStage
  1.15      0.49     0.01                             HAL_GetTick
  1.13      0.49     0.01                             USBD_LL_Start
  1.05      0.50     0.01                             USBD_GS_CAN_Transmit
  0.97      0.50     0.01                             USBD_GS_CAN_PrepareReceive
  0.96      0.51     0.01                             USB_Handler
  0.74      0.51     0.00                             timer_get
  0.70      0.52     0.00                             can_get_error_status
  0.68      0.52     0.00                             HAL_PCD_DataOutStageCallback
  0.67      0.53     0.00                             led_indicate_trx
  0.62      0.53     0.00                             HAL_PCD_DataInStageCallback
  0.61      0.53     0.00                             USBD_LL_PrepareReceive
  0.59      0.54     0.00                             USBD_LL_Transmit
  0.54      0.54     0.00                             can_find_free_mailbox
  0.48      0.54     0.00                             USBD_LL_GetRxDataSize
  0.47      0.54     0.00                             status_is_active
  0.45      0.55     0.00                             can_is_rx_pending
  0.36      0.55     0.00                             HAL_PCD_EP_GetRxCount
  0.35      0.55     0.00                             USBD_GS_CAN_DataIn
  0.32      0.55     0.00                             USBD_GS_CAN_DfuDetachRequested
  0.32      0.55     0.00                             HAL_PCD_SetupStageCallback
  0.24      0.56     0.00                             led_set_sequence_step
  0.22      0.56     0.00                             can_enable
  0.14      0.56     0.00                             can_receive
  0.13      0.56     0.00                             assert_failed
  0.13      0.56     0.00                             USBD_GS_CAN_GetStrDesc
  0.12      0.56     0.00                             HAL_GPIO_TogglePin
  0.12      0.56     0.00                             USBD_GS_CAN_Start
  0.09      0.56     0.00                             can_is_enabled
  0.09      0.56     0.00                             HAL_PCD_EP_DB_Transmit
  0.09      0.56     0.00                             led_run_sequence
  0.08      0.56     0.00                             HAL_PCDEx_ActivateLPM
  0.07      0.56     0.00                             HAL_PCD_SOFCallback
  0.07      0.56     0.00                             USBD_LL_SOF
  0.06      0.56     0.00                             NMI_Handler
  0.04      0.56     0.00                             HAL_PCDEx_LPM_Callback
  0.03      0.56     0.00                             HardFault_Handler
  0.02      0.56     0.00                             SysTick_Handler
  0.01      0.56     0.00                             USBD_GS_CAN_SOF
  0.01      0.56     0.00                             HAL_SYSTICK_Callback
  0.01      0.56     0.00                             HAL_SYSTICK_IRQHandler

(to get this, I run openocd as a gdb remote, and from gdb I enter monitor profile 5 test.out 0x8000000 0x8100000 , while the target was running a "cansequence -p 10" test. Then, gprof cannette_fw test.out -xb )

I'm surprised to see how much time we spend updating LEDs !!

marckleinebudde commented 2 years ago

Grounding led_update()

index 5453e736681d..8904a98036a2 100644
--- a/src/led.c
+++ b/src/led.c
@@ -137,6 +137,7 @@ static void led_update_sequence(led_data_t *leds)

 void led_update(led_data_t *leds)
 {
+    return;
     switch (leds->mode) {

         case led_mode_off:

increases TX CAN bus load from 80% -> 90% (on the replace-queues) branch.

fenugrec commented 2 years ago

I'm in favour of moving to -O2. It shouldn't be much harder to debug than -Os, and meaningful profiling runs would benefit from a test build at -Og anyway . It may still be useful to occasionally compile at other optimization levels to see where are the bottlenecks (if any).

fenugrec commented 2 years ago

I still think it would be nice to keep (theoretical) support 16kB-flash targets, but there are none known at the moment... We could revisit later; possibly selective optimization (gcc has some functionality for this IIRC) on the critical paths, etc.

marckleinebudde commented 2 years ago

Would be interesting what LTO does with/to different optimization levels 🤔

fenugrec commented 2 years ago

We've had lto enabled since 1712fec (~ 2019) , it makes a pretty big difference on size (16912 without, vs 13880) and throughput (64% -flto, 59% without). I don't see a reason to disable flto ! I think the days of -flto breaking builds (due to UB or occasional compiler bugs) are over.

fenugrec commented 2 years ago

I took the liberty of adding a comment to your commit message about removal of -fno-move-loop-invariants

Merged in 0612b5017f322d8675deb8a999312179ee437e48, thanks !

marckleinebudde commented 2 years ago

We've had lto enabled since 1712fec (~ 2019) , it makes a pretty big difference on size (16912 without, vs 13880) and throughput (64% -flto, 59% without). I don't see a reason to disable flto ! I think the days of -flto breaking builds (due to UB or occasional compiler bugs) are over.

Yes, LTO brings a pretty impressive performance gain and size reduction. I sometimes turn it off to get proper entries in the linker map file.

I was wondering what LTO does during re-compiling stage if some object files are compiled with -Os and some with -O2.

candle-usb / candleLight_fw

cmake: switch from -Os to -O2 optimization #143