Closed fanoush closed 7 months ago
Forgot one more reason. The old compiler has very slow linker, in typical case this is longest part of whole build. Newer compilers are much faster in last LTO linking phase (like ~10 seconds instead of > 1 minute on slower computers).
linking speed rm bin/*elf ; time make BOARD=BANGLEJS RELEASE=1
linking 8.2.1 (on i9 12900H - pretty fast computer, typically I use much slower machines) real 0m24.178s user 0m23.537s sys 0m0.649s linking 13.2.1 with -flto (sigle threaded, warning: using serial compilation of 11 LTRANS jobs) real 0m10.796s user 0m10.348s sys 0m0.459s linking 13.2.1 with -flto=auto real 0m1.777s user 0m13.339s sys 0m0.495s
The size reduction is not that much but it is there, here is Bangle 1 build - saves about 1.1KB BANGLE1 gcc version 13.2.1 CODE: 0x1f000 -> 0x757e8 (354100 bytes) Code area Fits before FS Area (2072b free) gcc version 8.2.1 CODE: 0x1f000 -> 0x75c34 (355180 bytes) Code area Fits before FS Area (972b free)
also the bootloader
BANGLE1 bootloader with 8.2.1 - build with untested spi flashing - only 8 bytes free
0007DFE0 00 10 00 00 00 30 08 00 41 A5 07 00 00 00 00 00 |.....0..A.......| 0007DFF0 25 C2 07 00 D1 C0 07 00 -- -- -- -- -- -- -- -- |%....... | 0007E000 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | |
with gcc version 13.2.1 there is 244 bytes saving, now 252 bytes free
0007DEE0 01 00 FF FF 00 90 D0 03 10 00 00 00 00 10 00 00 |................| 0007DEF0 00 30 08 00 71 A9 07 00 00 00 00 00 61 C1 07 00 |.0..q.......a...| 0007DF00 09 CD 07 00 -- -- -- -- -- -- -- -- -- -- -- -- |.... | 0007DF10 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | |
Newer compilers are much faster in last LTO linking phase
Where do I sign up? :) Even after a very expensive PC upgrade that is still frustrating!
Things like this make me think I really do need to finally make a proper test harness that will test out the firmwares automatically on real hardware. It would make merging PRs like this seem a lot less risky.
font in bootloader
Very strange - it's handled in https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L839-L840
It should see space, realise it's less than char 0
and just ignore it (stepping on X by 4)
Do you have a picture of it? It should only be capable of setting pixels, and ONLY on X+0/1/2 - so if you see pixels being set at 4*X+3
(3/7/11/etc) then something is going wrong maybe in lcd_pixel
Added PR just to see what I did with the code in ram fix if you want to test it. Don't have SWD for BANGLE2 here, will make a photo with bootloader font later.
here are photos https://ibb.co/album/hLbQ4M , just changing compiler back and rebuilding from same source fixes it
Where do I sign up? :) Even after a very expensive PC upgrade that is still frustrating!
my second fastest machine (Dell Optiplex 3000 thin client - 120EUR refurbished fanless thing from ebay, 3.3GHz Pentium Silver N6005)
gcc version 8.2.1 8-2018-q4-major - linking takes 57s out of 1m28.715s
make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 1m28.715s user 2m47.380s sys 0m8.753s
rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 0m57.828s user 0m56.636s sys 0m1.198s
gcc version 13.2.1 linking takes 9s our of 23s
make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 0m23.845s user 1m14.575s sys 0m8.153s
rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1 real 0m9.345s user 0m26.890s sys 0m1.446s
this is with -flto=auto
(=use more threads, autodetect # of CPU cores like make -j)
diff --git a/make/common/ARM.make b/make/common/ARM.make
index 335c1834d..c27139760 100644
--- a/make/common/ARM.make
+++ b/make/common/ARM.make
@@ -13,7 +13,7 @@ OPTIMIZEFLAGS += -fno-common -fno-exceptions -fdata-sections -ffunction-sections
# Given we've left 10k for it, there's no real reason to enable LTO anyway.
ifndef BOOTLOADER
# Enable link-time optimisations (inlining across files)
-OPTIMIZEFLAGS += -flto -fno-fat-lto-objects -Wl,--allow-multiple-definition
+OPTIMIZEFLAGS += -flto=auto -fno-fat-lto-objects -Wl,--allow-multiple-definition
DEFINES += -DLINK_TIME_OPTIMISATION
endif
and for bootloader possibly also
diff --git a/make/family/NRF52.make b/make/family/NRF52.make
index b6000d6ea..a3d8080fa 100644
--- a/make/family/NRF52.make
+++ b/make/family/NRF52.make
@@ -150,7 +150,7 @@ endif # USE_BOOTLOADER
ifdef BOOTLOADER
# we're trying to compile the bootloader itself
LINKER_FILE = $(LINKER_BOOTLOADER)
-OPTIMIZEFLAGS=-Os -flto -fno-fat-lto-objects -Wl,--allow-multiple-definition # try to reduce bootloader size
+OPTIMIZEFLAGS=-Os -flto=auto -fno-fat-lto-objects -Wl,--allow-multiple-definition # try to reduce bootloader size
else # not BOOTLOADER - compiling something to run under a bootloader
LINKER_FILE = $(LINKER_ESPRUINO)
INCLUDE += -I$(NRF5X_SDK_PATH)/nrf52_config
Good news is that the bug is in the bootloader LCD code :-)
There is mix of LCD_DATA_HEIGHT
and LCD_HEIGHT
LCD_HEIGHT https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L892 LCD_DATA_HEIGHT https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L863
I guess LCD_DATA_HEIGHT should be used everywhere as that is what is in the lcd.h
file so it is wrong in lcd_clear()
and also in lcd_flip()
in several places. When fixing it in lcd_clear()
there is no garbage on screen on bootloader startup.
EDIT: for bangle 2 it looks like sensible fix but when I tried to replace it everywhere I am not sure about these places https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L449-L450 https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L453 https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L485 hard to guess what is the idea here and why it is different from other cases
I would guess that when x,y is used to access data in lcd_data
it should not be multiplied by 2 (or use LCD_WIDTH/HEIGHT) or it may go past the array(?)
Great spot - thanks!
Just a bit of background - to make it readable on most devices, we have an offscreen buffer and pixel-double. ymin/max
should be the y values in that offscreen buffer (at half the size of the main one).
Those other bits look fine - although https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L485 could change (but it doesn't matter really as all it wants to do is make ymin>ymax
).
For the other places, it's setting ymin/ymax to double because then it scans out all the pixels and uses y>>1
when indexing into the data array. And after it resets ymin/ymax
Do you want to do a PR for that fix, or shall I stick it in?
Just to add, on GCC 8, I get:
make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1
real 0m21.997s
user 0m53.521s
sys 0m4.007s
rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1
real 0m19.008s
user 0m18.511s
sys 0m0.496s
So it's a lot faster, but compiling everything else takes 2 seconds, and the final step takes 19 seconds - so updating the compiler would still be a great help.
Do you want to do a PR for that fix, or shall I stick it in?
Please do without PR if possible. If you think everything else is OK then at least this causes the garbage (too much data copied) https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L892 and this is confusing but does not hurt as it gets fixed when first pixel is drawn https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L810
Just done :)
I just added flto=auto
and it doesn't appear to change the binary on 8.2.1
Am I right in thinking that there's actually not anything else that needs to change now before we swap compiler?
I'm tempted to make a 2v21 firmware release before we do the swap, just in case anything comes up from it that we didn't notice. Then there'll be several weeks to notice any issues before 2v22 :)
Yes, nothing else. There are those stdio warnings but should be harmless. Maybe debug builds could have more of that linked in? I do run Bangle2 build now in my watch and also started to use it for other builds but so far only flashed it to nrf52840 dongle which does not test more than Bangle2. EDIT: I also tried nrf52832 board with SDK12 and bootloader and it works. Will try stm32f103 and nrf51 in next days.
Tried STM32F103 build for STLINK V2 USB dongle and it works fine in device however in this case new compiler produces larger code 13.2.1 PASS - size of 120304 is under 120832 bytes 8.2.1 PASS - size of 119400 is under 120832 bytes This is interesting, maybe some Cortex M3 related thing or some libraries?
tried to build MICROBIT1 and that one is slightly smaller with newer compiler 13.2.1 CODE: 0x1b000 -> 0x3e934 (145716 bytes) 8.2.1 CODE: 0x1b000 -> 0x3ea0c (145932 bytes)
Tried also PICO_R1_3 build and that one is smaller with new compiler too 13.2.1 PASS - size of 326088 is under 327680 bytes 8.2.1 PASS - size of 327448 is under 327680 bytes
and ESPRUINOBOARD, slightly bigger 13.2.1 PASS - size of 217804 is under 231424 bytes 8.2.1 PASS - size of 217712 is under 231424 bytes
So looks like Cortex M3 thing(?)
Interesting! Thanks for all that testing.
It does look like an M3 thing (maybe they're putting more emphasis on M4 code generation and the M3 output is just M4 code with unsupported instructions replaced). It's a shame really as the M3 devices tend to have less flash so would really benefit.
It feels like a 1.5k+ saving on all modern platforms should outweigh 1-200 bytes extra on old platforms though, so I'm still all for this
Just looking into this a bit more - it's looking good. Interestingly the MICROBIT1 build doesn't have the _close
warnings, and commenting:
'DEFINES+=-DSAVE_ON_FLASH_EXTREME',
'DEFINES+=-DESPR_NO_DAYLIGHT_SAVING',
'DEFINES+=-DJSVAR_FORCE_NO_INLINE=1',
'CFLAGS += -ffreestanding', # needed for SAVE_ON_FLASH_EXTREME (jswrap_math, __aeabi_dsub)
'ASFLAGS += -D__STARTUP_CLEAR_BSS -D__START=main',
'LDFLAGS += -nostartfiles',
Makes them come back.
Doing the builds the same way (using the compiler in a zip in GitHub) won't work as it's now too big (even when I remove all the targets we don't need), so I've done it direct from ARM for now.
Last time it made the builds slower and unreliable as the 'external' server would sometimes fail, but we'll have to see what happens.
It's in - and looks good: https://github.com/espruino/Espruino/actions/runs/7833136232
using the compiler in a zip in GitHub) won't work as it's now too big (even when I remove all the targets we don't need)
what is the limit? got it down from 170MB to 60MB by removing targets, stripping symbols from binaries and compressing into tar.zx (XZ_OPT=-8eT0 tar -Jcvf 13.2.tar.xz arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi
)
I see it described as 100MB here https://github.com/espruino/EspruinoBuildTools/tree/master/arm For that you can just strip binaries
tar -xvf arm-gnu-toolchain-13.2.rel1-x86_64-arm-none-eabi.tar.xz
cd arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi/
strip bin/* arm-none-eabi/bin/* libexec/gcc/arm-none-eabi/13.2.1/{cc*,f*,lto1}
cd ..
XZ_OPT=-8eT0 tar -Jcf 13.2.1-stripped.tar.xz arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi/
gives file size 83056216 EDIT: oh I see the trick is mainly in compression flags, just stripped binaries with default xz compression options is 141691464 vs 83054996 with -8e (and -9e gives 67400860).
That's great! The limit is 100 so that would be perfect.
It must be the XZ_OPT - I tried here and I deleted everything apart from:
And it still was at 120 with the default options.
Is there any side-effect of stripping the files? Does it hurt debugability? I just made the mistake of trying to google the man page for strip...
Does it hurt debugability?
Debuggability of gcc itself when it crashes or miscompiles probably yes. This strips just compiler executables, not target libraries.
This strips just compiler executables, not target libraries.
Oh wow, ok - that's perfect then. Thanks!
I'm not going to get around to this before my break so I'll reopen for now, and close when I get the compiler in GitHub
Creating this issue to track more issues and discuss rationale.
Current 8-2018-q4-major 8.2.1 compiler downloaded by provision script is quite old but produced shorter code than newer ones. However when testing latest one 13.2.rel1 gcc 13.2.1 from https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads I see that it produces smaller code both for bootloader and main firmware. However there are some issues:
There is a breakage with newer gcc compilers in bootloader build - it cannot handle code in ".data" section, gives "Error: leb128 operand is an undefined symbol". https://www.mail-archive.com/gcc-bugs@gcc.gnu.org/msg718258.html I have a fix based on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103698#c4 that seems to work, so I can create a PR, it works with old compiler too
When linking main firmware it produces set of (likely harmless) warnings like
Discussed here https://community.arm.com/support-forums/f/compilers-and-libraries-forum/53519/arm-gcc-11-3-rel1-newlib-nano-linker-warning-_read-is-not-implemented-and-will-always-fail Looks like it works fine and the code calling this is not present in final binary but it is a bit strange. I did not succeed to make it go away when messing with nosys specs as mentioned in various places. However I don't see this in bootloader build (maybe something from stdio is not pulled in there at all, in main FW it may be there but optimized out).
For BANGLE2 I have tested both bootloader and main firmware and in general it seems to work for me - Espruino works, I see no issue with any apps I use. Bootloader can start Nordic DFU, can also flash update from storage file however there is one issue - the screen output/font rendering in bootloader is somehow garbled most probably when printing space character (?), font letters can be read but there are random pixels set between that. I briefly checked font rendering but don't see how space is rendered, any tips? Can be different layout after linking or alignment issue or some timing issue however other font letters are OK and screen works perfectly fine in main firmware.