xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0

rojer commented 7 months ago

We are migrating from IDF 4.4.4 to 5.2.1 and among the many changes is the toolchgain update from GCC 8.4.0 (xtensa-esp32-elf-gcc8_4_0-esp-2021r2-patch5) to 13.2.0 (xtensa-esp-elf-13.2.0_20230928).

Unfortunately, it seems that it comes with an across the board regression in the output binary size - many functions gain size, resulting in overall binary size increase, most critically we are bumping into IRAM size limits on some of our apps.

By comparing code generated with the different toolchains, i identified the following problems:

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.
the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.
sometimes this code is duplicated where previous result could be reused.

On the other hand, there are positives too:

inlined loop-based memset: size-neutral and eliminates a call. neat!
better register allocation: using a10 avoids an extra move when it's used again as an argument to a function call.

This was after just a quick look at one particular function: spi_flash_chip_winbond_page_program, it gained 10 bytes. 4 bytes of those are accounted for by initialization of .flags, but even without that the function comes out 2 bytes bigger. Here are the notes from my analysis.

rojer commented 7 months ago

ok, some of the movi /movi.n difference can be attributed to jump address alignment and the denisty can be improved with -mno-target-align. this still leaves one movi where movi.ncould be used - in the struct initialzation block. however, size reduction is significant: 8K in our case.

Lapshin commented 5 months ago

@rojer , sorry for the long delay.

Recently binary size was fixed with linker script changes:

Could you please check your project with these changes?

rojer commented 5 months ago

well, those seem unrelated to the issues i reported here that are about assembly code generation. that said - thank you for letting me know, i will see if i can test with these changes soon.

Lapshin commented 5 months ago

@rojer , sure, these changes are unrelated but that also reduces binary size significantly after the toolchain upgrade. I didn't have time to take a look at your points yet, will do this a bit later.

@jcmvbkbc , maybe you can answer them quickly?

jcmvbkbc commented 5 months ago

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

jjsuwa-sys3175 commented 5 months ago

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

I agree so.

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

and 4 bytes of litpool entry :)

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

Maybe it would be better to have an option to control whether or not constant synthesis is performed individually.

rojer commented 5 months ago

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

jjsuwa-sys3175 commented 5 months ago

This may not be an excuse, but for the ESP8266 Arduino core, these patches including the constant synthesis work well in terms of reducing size and clock cycles.

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

As it happens, I have a WIP patch that does exactly what you describe :)

The reason why that patch has been left in WIP for so long is that gcc's built-in litpool mechanism provides duplicate elimination but does not provide recording the number of duplicates and referencing them later. So, I have to create them myself, which is so ugly :-(

rojer commented 5 months ago

@Lapshin

Could you please check your project with these changes?

we are using 5.2.1, and these were kind of difficult to apply after the big refactoring in https://github.com/espressif/esp-idf/commit/40be44f8274f4d160cd4bcd25e37a6d72141a929 but i tried to follow the spirit if not the letter

espressif/esp-idf@9375348

discarding eh_frame if exceptions are not enabled - nice, significant reduction of ~10K

espressif/esp-idf@2b36636

couldn't apply this to app, no visible effect on bootloader

espressif/esp-idf@afffbd9

~500 byte saved

Lapshin commented 5 months ago

we are using 5.2.1

@rojer , you may consider switching to bugfix version 5.2.2 which has these two commits https://github.com/espressif/esp-idf/commit/1f3f65b40ea01b93aa4b48c72671c7d84b504766 , https://github.com/espressif/esp-idf/commit/dcf6b54e94021ad2901ad5fdc29263154d32b3d5

Regarding the third change - it's already backported to release/v5.2 branch (but changes are not synced with github yet). They seem to appear soon

no visible effect on bootloader

This would be a breaking change, will be added in v6.0

Lapshin commented 5 months ago

@rojer , you could check https://github.com/espressif/esp-idf/commit/5320ec20f9302f260265b27d55ade1bc92b92ea9 in release/v5.2 branch

rojer commented 5 months ago

@Lapshin I did a quick test with current release/v5.2 (https://github.com/espressif/esp-idf/commit/e5ffb3c57d59d8c94a4c662a8d0d76afa096bc13), and results are interesting:

Xtensa (ESP32) target binary size went down by almost 50K: 1632032 -> 1581616
Similar RISC-V (ESP32-C3) target went up by almost 12K: 1623536 -> 1635792

needless to say, i'm quite happy with (1) and not so much with (2) :) do you have an explanation for why this may be the case?

Lapshin commented 5 months ago

@rojer , I also did not expect this T_T

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE or with mapuche that has gui

jjsuwa-sys3175 commented 5 months ago

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

Please see https://github.com/gcc-mirror/gcc/commit/23141088e8fb50bf916ac0b2e364b1eef9f3569d, I hope this is what you wanted :)

rojer commented 5 months ago

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

jjsuwa-sys3175 commented 5 months ago

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

All I can say is: I am not a party to that :)

Lapshin commented 5 months ago

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

@rojer , from the IDF side we are waiting for GCC 14.2 release with bugfixes to make the new release. It looks like this will be available from v5.4

@jjsuwa-sys3175 , thank you for giving the solution quickly!

jjsuwa-sys3175 commented 5 months ago

Sorry for this being completely off-topic, but I would like to mention the use of "-mextra-l32r-costs=" in situations where speed is more important than size.

Reading data from the instruction memory, such as L32R instruction, may require implementation-dependent additional clocks in addition to 1 clock execution for the instruction itself and 1 clock delay due to the pipeline until the read result is available.

For example, on the ESP8266, a delay of 4 or 5 clocks is observed for each L32R instructions that cannot be hidden by overlapping with other instructions.

Therefore, by compiling with "-O2 -mextra-l32r-costs=5", constant synthesis (avoiding the use of time-consuming L32Rs) can be performed more aggressively (the default value of "-mextra-l32r-costs=" is 0, so not specifying it means that there are no additional clocks for the execution of L32R).

rojer commented 5 months ago

@jjsuwa-sys3175 thanks for the additional context, it's good to know. however, at least in our environment, we almost exclusively operate under the memory pressure, both RAM and flash, while cycles are essentially free or near free.

rojer commented 5 months ago

@Lapshin

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE

see attached diffs for ESP32 and ESP32-C3. what do you see? on C3 i see 10K+ of wireless bloat... sorry, innovation :) at the same time, ESP32 lost a bunch of fat.

diff_esp32.txt diff_esp32c3.txt

rojer commented 5 months ago

also, looks like idf_size.py has a bug computing .rodata size - no way pthread.c uses 150K+ of rodata. it seems that the first entry gets bogus size.

Lapshin commented 1 month ago

@rojer , sorry for the long delay. Latest news:

Toolchain with GCC 14.2.0 was released
- It has a change https://github.com/gcc-mirror/gcc/commit/23141088e8fb50bf916ac0b2e364b1eef9f3569d

Regarding .rodata size you can learn here (link):

The size of the .rodata section in the Flash Data memory type may appear very large for a single archive. This occurs due to linker relaxations. The linker may attempt to combine object file sections with MERGE and STRINGS flags from all archives into one to perform tail string optimization. Consequently, one archive may end up with a very large .rodata section, containing string literals from other archives. This is evident in the .rodata section of the libesp_app_format.a archive. The specific compiler behavior here can be turned off by enabling CONFIG_COMPILER_NO_MERGE_CONSTANTS option (only for GCC toolchain), please read help for more details.

I will take a look at your diff files soon

Lapshin commented 1 month ago

@rojer , as I can see you already created issue https://github.com/espressif/esp-idf/issues/14076 for the part that has grown the most.

Sorry, but I can't help you here because I'm not a member of bluetooth/wifi team. But from the IDF perspective you still have some options to free some space. For example:

Disable CONFIG_ESP_CONSOLE_SECONDARY_USB_SERIAL_JTAG if you don't need this (save about 3200 bytes).
Disable CONFIG_VFS_SUPPORT_TERMIOS (save ~2000 bytes)
The full list you can learn on this page.
this page in case there is lack of free IRAM

rojer commented 2 weeks ago

sorry for the delay, finally had some spare time to test the new toolchain.

GCC 13.2.0 -> 14.2.0: nice 14.6K size reduction on Xtensa, no change on RISC-V
CONFIG_ESP_CONSOLE_SECONDARY_USB_SERIAL_JTAG - n/a (we're using 5.2.2, will keep in mind for the future)
CONFIG_VFS_SUPPORT_TERMIOS=n - additional 5.5K saved on xtensa, and 6K saved on risc-v - great!

i will also take a look at the docs

ok, i'm going to close this one, i am satisfied with the improvements made to the toolchain and config, we're good. thanks @Lapshin and @jjsuwa-sys3175 for your help, much appreciated!

rojer commented 2 weeks ago

@Lapshin i also found some time to investigate another size-related issue, for which so far we've been using a workaround: during update from 4.4 to 5.2, our c++ code (and our app is primarily c++) ballooned significantly. i found a workaround: keeping cxx_std at gnu++17 instead of gnu++2b made it go away, but didn't investigate at the time. well, today i did, and found the reason: https://github.com/espressif/crosstool-NG/issues/67

while c++17 is ok, we'd very much like to move on to newer and shinier things but we can't, because of the size bloat. could you please have a look?

while looking, i found that actually, switching from c++17 to c++2b should be beneficial for the size as there are more size reductions than increases: for the test app i'm using, 869 symbols shrunk in size compared to 474 that grew. unfortunately, the std::string-related bloat seems to have drowned out all the improvements made elsewhere and net gain in size is about 16K for us, which is a lot.

espressif / crosstool-NG

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0 #52