espruino / Espruino

The Espruino JavaScript interpreter - Official Repo
http://www.espruino.com/
Other
2.76k stars 743 forks source link

using newer compiler for ARM (13.2.Rel1) #2455

Closed fanoush closed 7 months ago

fanoush commented 8 months ago

Creating this issue to track more issues and discuss rationale.

Current 8-2018-q4-major 8.2.1 compiler downloaded by provision script is quite old but produced shorter code than newer ones. However when testing latest one 13.2.rel1 gcc 13.2.1 from https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads I see that it produces smaller code both for bootloader and main firmware. However there are some issues:

fanoush commented 8 months ago

Forgot one more reason. The old compiler has very slow linker, in typical case this is longest part of whole build. Newer compilers are much faster in last LTO linking phase (like ~10 seconds instead of > 1 minute on slower computers).

linking speed rm bin/*elf ; time make BOARD=BANGLEJS RELEASE=1

linking 8.2.1 (on i9 12900H - pretty fast computer, typically I use much slower machines) real 0m24.178s user 0m23.537s sys 0m0.649s linking 13.2.1 with -flto (sigle threaded, warning: using serial compilation of 11 LTRANS jobs) real 0m10.796s user 0m10.348s sys 0m0.459s linking 13.2.1 with -flto=auto real 0m1.777s user 0m13.339s sys 0m0.495s

The size reduction is not that much but it is there, here is Bangle 1 build - saves about 1.1KB BANGLE1 gcc version 13.2.1 CODE: 0x1f000 -> 0x757e8 (354100 bytes) Code area Fits before FS Area (2072b free) gcc version 8.2.1 CODE: 0x1f000 -> 0x75c34 (355180 bytes) Code area Fits before FS Area (972b free)

also the bootloader

BANGLE1 bootloader with 8.2.1 - build with untested spi flashing - only 8 bytes free

0007DFE0 00 10 00 00 00 30 08 00 41 A5 07 00 00 00 00 00 |.....0..A.......| 0007DFF0 25 C2 07 00 D1 C0 07 00 -- -- -- -- -- -- -- -- |%....... | 0007E000 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | |

with gcc version 13.2.1 there is 244 bytes saving, now 252 bytes free

0007DEE0 01 00 FF FF 00 90 D0 03 10 00 00 00 00 10 00 00 |................| 0007DEF0 00 30 08 00 71 A9 07 00 00 00 00 00 61 C1 07 00 |.0..q.......a...| 0007DF00 09 CD 07 00 -- -- -- -- -- -- -- -- -- -- -- -- |.... | 0007DF10 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | |

gfwilliams commented 8 months ago

Newer compilers are much faster in last LTO linking phase

Where do I sign up? :) Even after a very expensive PC upgrade that is still frustrating!

Things like this make me think I really do need to finally make a proper test harness that will test out the firmwares automatically on real hardware. It would make merging PRs like this seem a lot less risky.

font in bootloader

Very strange - it's handled in https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L839-L840

It should see space, realise it's less than char 0 and just ignore it (stepping on X by 4)

Do you have a picture of it? It should only be capable of setting pixels, and ONLY on X+0/1/2 - so if you see pixels being set at 4*X+3 (3/7/11/etc) then something is going wrong maybe in lcd_pixel

fanoush commented 8 months ago

Added PR just to see what I did with the code in ram fix if you want to test it. Don't have SWD for BANGLE2 here, will make a photo with bootloader font later.

fanoush commented 8 months ago

here are photos https://ibb.co/album/hLbQ4M , just changing compiler back and rebuilding from same source fixes it

fanoush commented 8 months ago

Where do I sign up? :) Even after a very expensive PC upgrade that is still frustrating!

my second fastest machine (Dell Optiplex 3000 thin client - 120EUR refurbished fanless thing from ebay, 3.3GHz Pentium Silver N6005)

gcc version 8.2.1 8-2018-q4-major - linking takes 57s out of 1m28.715s

make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 1m28.715s user 2m47.380s sys 0m8.753s

rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 0m57.828s user 0m56.636s sys 0m1.198s

gcc version 13.2.1 linking takes 9s our of 23s

make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 0m23.845s user 1m14.575s sys 0m8.153s

rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 0m9.345s user 0m26.890s sys 0m1.446s

this is with -flto=auto (=use more threads, autodetect # of CPU cores like make -j)

diff --git a/make/common/ARM.make b/make/common/ARM.make
index 335c1834d..c27139760 100644
--- a/make/common/ARM.make
+++ b/make/common/ARM.make
@@ -13,7 +13,7 @@ OPTIMIZEFLAGS += -fno-common -fno-exceptions -fdata-sections -ffunction-sections
 # Given we've left 10k for it, there's no real reason to enable LTO anyway.
 ifndef BOOTLOADER
 # Enable link-time optimisations (inlining across files)
-OPTIMIZEFLAGS += -flto -fno-fat-lto-objects -Wl,--allow-multiple-definition
+OPTIMIZEFLAGS += -flto=auto -fno-fat-lto-objects -Wl,--allow-multiple-definition
 DEFINES += -DLINK_TIME_OPTIMISATION
 endif

and for bootloader possibly also

diff --git a/make/family/NRF52.make b/make/family/NRF52.make
index b6000d6ea..a3d8080fa 100644
--- a/make/family/NRF52.make
+++ b/make/family/NRF52.make
@@ -150,7 +150,7 @@ endif # USE_BOOTLOADER
 ifdef BOOTLOADER
 # we're trying to compile the bootloader itself
 LINKER_FILE = $(LINKER_BOOTLOADER)
-OPTIMIZEFLAGS=-Os -flto -fno-fat-lto-objects -Wl,--allow-multiple-definition # try to reduce bootloader size
+OPTIMIZEFLAGS=-Os -flto=auto -fno-fat-lto-objects -Wl,--allow-multiple-definition # try to reduce bootloader size
 else # not BOOTLOADER - compiling something to run under a bootloader
 LINKER_FILE = $(LINKER_ESPRUINO)
 INCLUDE += -I$(NRF5X_SDK_PATH)/nrf52_config
fanoush commented 8 months ago

Good news is that the bug is in the bootloader LCD code :-)

There is mix of LCD_DATA_HEIGHT and LCD_HEIGHT

LCD_HEIGHT https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L892 LCD_DATA_HEIGHT https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L863

I guess LCD_DATA_HEIGHT should be used everywhere as that is what is in the lcd.h file so it is wrong in lcd_clear() and also in lcd_flip() in several places. When fixing it in lcd_clear() there is no garbage on screen on bootloader startup.

EDIT: for bangle 2 it looks like sensible fix but when I tried to replace it everywhere I am not sure about these places https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L449-L450 https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L453 https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L485 hard to guess what is the idea here and why it is different from other cases

I would guess that when x,y is used to access data in lcd_data it should not be multiplied by 2 (or use LCD_WIDTH/HEIGHT) or it may go past the array(?)

gfwilliams commented 8 months ago

Great spot - thanks!

Just a bit of background - to make it readable on most devices, we have an offscreen buffer and pixel-double. ymin/max should be the y values in that offscreen buffer (at half the size of the main one).

Those other bits look fine - although https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L485 could change (but it doesn't matter really as all it wants to do is make ymin>ymax).

For the other places, it's setting ymin/ymax to double because then it scans out all the pixels and uses y>>1 when indexing into the data array. And after it resets ymin/ymax

Do you want to do a PR for that fix, or shall I stick it in?


Just to add, on GCC 8, I get:

make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1
real    0m21.997s
user    0m53.521s
sys 0m4.007s

rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1
real    0m19.008s
user    0m18.511s
sys 0m0.496s

So it's a lot faster, but compiling everything else takes 2 seconds, and the final step takes 19 seconds - so updating the compiler would still be a great help.

fanoush commented 8 months ago

Do you want to do a PR for that fix, or shall I stick it in?

Please do without PR if possible. If you think everything else is OK then at least this causes the garbage (too much data copied) https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L892 and this is confusing but does not hurt as it gets fixed when first pixel is drawn https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L810

gfwilliams commented 8 months ago

Just done :)

gfwilliams commented 8 months ago

I just added flto=auto and it doesn't appear to change the binary on 8.2.1

Am I right in thinking that there's actually not anything else that needs to change now before we swap compiler?

I'm tempted to make a 2v21 firmware release before we do the swap, just in case anything comes up from it that we didn't notice. Then there'll be several weeks to notice any issues before 2v22 :)

fanoush commented 8 months ago

Yes, nothing else. There are those stdio warnings but should be harmless. Maybe debug builds could have more of that linked in? I do run Bangle2 build now in my watch and also started to use it for other builds but so far only flashed it to nrf52840 dongle which does not test more than Bangle2. EDIT: I also tried nrf52832 board with SDK12 and bootloader and it works. Will try stm32f103 and nrf51 in next days.

fanoush commented 8 months ago

Tried STM32F103 build for STLINK V2 USB dongle and it works fine in device however in this case new compiler produces larger code 13.2.1 PASS - size of 120304 is under 120832 bytes 8.2.1 PASS - size of 119400 is under 120832 bytes This is interesting, maybe some Cortex M3 related thing or some libraries?

tried to build MICROBIT1 and that one is slightly smaller with newer compiler 13.2.1 CODE: 0x1b000 -> 0x3e934 (145716 bytes) 8.2.1 CODE: 0x1b000 -> 0x3ea0c (145932 bytes)

Tried also PICO_R1_3 build and that one is smaller with new compiler too 13.2.1 PASS - size of 326088 is under 327680 bytes 8.2.1 PASS - size of 327448 is under 327680 bytes

and ESPRUINOBOARD, slightly bigger 13.2.1 PASS - size of 217804 is under 231424 bytes 8.2.1 PASS - size of 217712 is under 231424 bytes

So looks like Cortex M3 thing(?)

gfwilliams commented 8 months ago

Interesting! Thanks for all that testing.

It does look like an M3 thing (maybe they're putting more emphasis on M4 code generation and the M3 output is just M4 code with unsupported instructions replaced). It's a shame really as the M3 devices tend to have less flash so would really benefit.

It feels like a 1.5k+ saving on all modern platforms should outweigh 1-200 bytes extra on old platforms though, so I'm still all for this

gfwilliams commented 7 months ago

Just looking into this a bit more - it's looking good. Interestingly the MICROBIT1 build doesn't have the _close warnings, and commenting:

     'DEFINES+=-DSAVE_ON_FLASH_EXTREME',
     'DEFINES+=-DESPR_NO_DAYLIGHT_SAVING',
     'DEFINES+=-DJSVAR_FORCE_NO_INLINE=1',
     'CFLAGS += -ffreestanding', # needed for SAVE_ON_FLASH_EXTREME (jswrap_math, __aeabi_dsub)
     'ASFLAGS += -D__STARTUP_CLEAR_BSS -D__START=main',
     'LDFLAGS += -nostartfiles',

Makes them come back.

Doing the builds the same way (using the compiler in a zip in GitHub) won't work as it's now too big (even when I remove all the targets we don't need), so I've done it direct from ARM for now.

Last time it made the builds slower and unreliable as the 'external' server would sometimes fail, but we'll have to see what happens.

gfwilliams commented 7 months ago

It's in - and looks good: https://github.com/espruino/Espruino/actions/runs/7833136232

fanoush commented 7 months ago

using the compiler in a zip in GitHub) won't work as it's now too big (even when I remove all the targets we don't need)

what is the limit? got it down from 170MB to 60MB by removing targets, stripping symbols from binaries and compressing into tar.zx (XZ_OPT=-8eT0 tar -Jcvf 13.2.tar.xz arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi)

fanoush commented 7 months ago

I see it described as 100MB here https://github.com/espruino/EspruinoBuildTools/tree/master/arm For that you can just strip binaries

tar -xvf arm-gnu-toolchain-13.2.rel1-x86_64-arm-none-eabi.tar.xz
cd arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi/
strip  bin/* arm-none-eabi/bin/* libexec/gcc/arm-none-eabi/13.2.1/{cc*,f*,lto1}
cd ..
XZ_OPT=-8eT0 tar -Jcf 13.2.1-stripped.tar.xz arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi/

gives file size 83056216 EDIT: oh I see the trick is mainly in compression flags, just stripped binaries with default xz compression options is 141691464 vs 83054996 with -8e (and -9e gives 67400860).

gfwilliams commented 7 months ago

That's great! The limit is 100 so that would be perfect.

It must be the XZ_OPT - I tried here and I deleted everything apart from:

And it still was at 120 with the default options.

Is there any side-effect of stripping the files? Does it hurt debugability? I just made the mistake of trying to google the man page for strip...

fanoush commented 7 months ago

Does it hurt debugability?

Debuggability of gcc itself when it crashes or miscompiles probably yes. This strips just compiler executables, not target libraries.

gfwilliams commented 7 months ago

This strips just compiler executables, not target libraries.

Oh wow, ok - that's perfect then. Thanks!

I'm not going to get around to this before my break so I'll reopen for now, and close when I get the compiler in GitHub